https://github.com/abrie/custom-web-scraper
A One-off web scraper.
https://github.com/abrie/custom-web-scraper
Last synced: 2 months ago
JSON representation
A One-off web scraper.
- Host: GitHub
- URL: https://github.com/abrie/custom-web-scraper
- Owner: abrie
- Created: 2020-01-26T14:43:20.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-01-26T17:12:49.000Z (over 6 years ago)
- Last Synced: 2025-03-18T21:23:41.079Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Custom Web Scraper
This is a one-off project to scrape data from the web. Built for a hospitality and entertainment product. Documented here for posterity and discussion. Specific urls and proprietary details are excluded from this repository.
## Technologies
It uses a mix of technologies, selected for expedience and utility:
Make, Bash, [cURL](https://curl.haxx.se/), [awk](https://www.gnu.org/software/gawk/manual/gawk.html), Python3, [jq](https://stedolan.github.io/jq/).
## Overview
The scraper runs in a series of stages. Each stage takes an input generates an output. Outputs are cached on the filesystem. The stages invoked through a `Makefile`
| Stage | Input | Action | Ouput |
| ----- | ---------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------ |
| 1 | **secret** | [Bash scripts cURL to query a list of urls](step1.sh) | A list indexed 'location' headers |
| 2 | stage 1 | [Awk extracts url from location header](Makefile#L14) | A list of indexed Urls |
| 3 | stage 2 | [Python iterates through list and caches url content](step3.py) | Directory of .gz files named by index value |
| 4 | stage 3 | [Python iterates through cached .gz files and applies regex for fields of interest](step4.py) | Directory of JSON files named by index |
| 5 | stage 4 | Bash and jq filter json files according to tuned selection criteria | A file with a list of indexes relevant to search |