Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lalelisealstad/dataengineer-job-scraper-etl
Data pipeline that scrapes Data Engineer job postings from London, UK, using GCP tools and extracts skill-related keywords using spaCy. The goal is to analyse in-demand skills for data engineering positions.
https://github.com/lalelisealstad/dataengineer-job-scraper-etl
Last synced: 23 days ago
JSON representation
Data pipeline that scrapes Data Engineer job postings from London, UK, using GCP tools and extracts skill-related keywords using spaCy. The goal is to analyse in-demand skills for data engineering positions.
- Host: GitHub
- URL: https://github.com/lalelisealstad/dataengineer-job-scraper-etl
- Owner: lalelisealstad
- License: mit
- Created: 2024-07-13T10:18:03.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-08-15T13:28:49.000Z (4 months ago)
- Last Synced: 2024-08-15T15:21:56.731Z (4 months ago)
- Language: Python
- Homepage:
- Size: 73.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data Engineer Job Scraper ETL
This repository contains the source code for a data pipeline that automatically scrapes Data Engineer, Data Scientist and Data Analyst job postings every night using Google Cloud Platform (GCP) tools: Cloud Scheduler, Pub/Sub, Cloud Functions, and Cloud Storage. The program collects job descriptions for positions in London, UK posted in the last 24 hours and uses the spaCy NLP package to extract words describing "skills" needed for the positions. The purpose of the pipeline is to analyze in-demand skills for data jobs. A dashboard to visualize the results will be created in another repository.
#### Deployment Notes:
The program is deployed in Cloud Functions using GitHub Actions. It is triggered daily by Pub/Sub messages with the search term (Data Engineer, Data Scientist and Data Analyst) from Cloud Scheduler.#### Development Notes:
To run the program for the first time:
```
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
$ python -m spacy download en_core_web_lg
```Run the program after installation:
```
$ source .venv/bin/activate
$ python "main.py"
```### Resources
I used this approach for scraping Linkedin data:
https://medium.com/@alaeddine.grine/linkedin-job-scraper-and-matcher-85d0308ef9aaUsed skills file from:
https://raw.githubusercontent.com/kingabzpro/jobzilla_ai/main/jz_skill_patterns.jsonl