https://github.com/abrie/custom-web-scraper

A One-off web scraper.
https://github.com/abrie/custom-web-scraper

Last synced: 2 months ago
JSON representation

A One-off web scraper.

Host: GitHub
URL: https://github.com/abrie/custom-web-scraper
Owner: abrie
Created: 2020-01-26T14:43:20.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-01-26T17:12:49.000Z (over 6 years ago)
Last Synced: 2025-03-18T21:23:41.079Z (about 1 year ago)
Language: Python
Homepage:
Size: 5.86 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Custom Web Scraper

This is a one-off project to scrape data from the web. Built for a hospitality and entertainment product. Documented here for posterity and discussion. Specific urls and proprietary details are excluded from this repository.

## Technologies

It uses a mix of technologies, selected for expedience and utility:

Make, Bash, [cURL](https://curl.haxx.se/), [awk](https://www.gnu.org/software/gawk/manual/gawk.html), Python3, [jq](https://stedolan.github.io/jq/).

## Overview

The scraper runs in a series of stages. Each stage takes an input generates an output. Outputs are cached on the filesystem. The stages invoked through a `Makefile`

| Stage | Input      | Action                                                                                        | Ouput                                            |

| ----- | ---------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------ |

| 1     | **secret** | [Bash scripts cURL to query a list of urls](step1.sh)                                         | A list indexed 'location' headers                |

| 2     | stage 1    | [Awk extracts url from location header](Makefile#L14)                                         | A list of indexed Urls                           |

| 3     | stage 2    | [Python iterates through list and caches url content](step3.py)                               | Directory of .gz files named by index value      |

| 4     | stage 3    | [Python iterates through cached .gz files and applies regex for fields of interest](step4.py) | Directory of JSON files named by index           |

| 5     | stage 4    | Bash and jq filter json files according to tuned selection criteria                           | A file with a list of indexes relevant to search |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abrie/custom-web-scraper

Awesome Lists containing this project

README