https://github.com/lalelisealstad/dataengineer-job-scraper-etl

Data pipeline that scrapes Data Engineer job postings from London, UK, using GCP tools and extracts skill-related keywords using spaCy. The goal is to analyse in-demand skills for data engineering positions.
https://github.com/lalelisealstad/dataengineer-job-scraper-etl

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/lalelisealstad/dataengineer-job-scraper-etl
Owner: lalelisealstad
License: mit
Created: 2024-07-13T10:18:03.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-03-09T19:10:23.000Z (4 months ago)
Last Synced: 2025-03-09T20:20:00.980Z (4 months ago)
Language: Python
Homepage:
Size: 94.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Data Engineer Job Scraper ETL

This repository contains the source code for a data pipeline that automatically scrapes Data Engineer, Data Scientist and Data Analyst job postings every night using Google Cloud Platform (GCP) tools: Cloud Scheduler, Pub/Sub, Cloud Functions, and Cloud Storage. The program collects job descriptions for positions in London, UK posted in the last 24 hours and uses the spaCy NLP package to extract words describing "skills" needed for the positions. The purpose of the pipeline is to analyze in-demand skills for data jobs. A dashboard to visualize the results is available [here:](https://job-dashboard-qytxiv2xfq-lz.a.run.app/)

Dataflow diagra:
![alt text](etl_process.png)

#### Deployment Notes:
The program is deployed in Cloud Functions using GitHub Actions. It is triggered daily by Pub/Sub messages with the search term (Data Engineer, Data Scientist and Data Analyst) from Cloud Scheduler.

#### Development Notes:
To run the program for the first time:
```
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
$ python -m spacy download en_core_web_lg
```

Run the program after installation:
```
$ source .venv/bin/activate
$ python "main.py"
```

### Resources
I used this approach for scraping Linkedin data:
https://medium.com/@alaeddine.grine/linkedin-job-scraper-and-matcher-85d0308ef9aa

Used skills file from:
https://raw.githubusercontent.com/kingabzpro/jobzilla_ai/main/jz_skill_patterns.jsonl

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lalelisealstad/dataengineer-job-scraper-etl

Awesome Lists containing this project

README