Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dodat-12/big-data-with-gcp
Experimenting GCP for Big Data Project
https://github.com/dodat-12/big-data-with-gcp
Last synced: 2 months ago
JSON representation
Experimenting GCP for Big Data Project
- Host: GitHub
- URL: https://github.com/dodat-12/big-data-with-gcp
- Owner: DoDat-12
- Created: 2024-10-17T08:07:17.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-12T03:42:09.000Z (2 months ago)
- Last Synced: 2024-11-13T22:36:30.036Z (2 months ago)
- Language: Python
- Homepage:
- Size: 490 KB
- Stars: 0
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Experimenting GCP for Big Data Project
![alt text](docs/pipeline.png)
## Prerequisite
- Account with Google Cloud Platform, Billing enabled (Skip)
- Create service Account with owner access (Skip)
- id: `tadod-sa-434`
- Go to manage keys and create `serviceKeyGoogle.json` key, store in this directory (put in .gitignore)
- Enable APIs (Skip)
- Compute Engine API
- Cloud Dataproc API
- Cloud Resource Manager API
- Set up virtual environmentpy -m venv env
./env/Scripts/activate- Python libraries
pip install -r requirements.txt- Create Service Account with owner role, create key and save with name `serviceKeyGoogle.json`
- Test run
py setup_test.py
## Project Structure
- `gcs`
- `bucket.py` - function to create bucket on Google Cloud Storage
- `load_data.py` - functions to download data and upload to bucket on GCS
- `main.py` - execution file- `dataproc`
- `cluster.py` - functions to manage dataproc cluster (create, update, delete, start, stop, submit job)
- `jobs` - contains PySpark jobs
- `wh_init.py` - init data warehouse on Google BigQuery (year: 2011)
- `wh_batch_load.py` - batch processing each year from 2012 to present
- `main.py` - execution file- `bigquery`
- `docs` - files for README.md
- `setup_test.py` - check authen from local to GCP
- `serviceKeyGoogle.json` - Service Account key for authentication## Project Infomation
- Project ID: `uber-analysis-439804`
- Region: `us-central1`
- Dataproc's Cluster Name: `uber-hadoop-spark-cluster`
- Bucket's Name:
- Raw data: `uber-{year}-154055`
- PySpark jobs and tmp dir: `uber-pyspark-jobs`> pip freeze > requirements.txt