Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dodat-12/big-data-with-gcp

Experimenting GCP for Big Data Project
https://github.com/dodat-12/big-data-with-gcp

Last synced: 2 months ago
JSON representation

Experimenting GCP for Big Data Project

Host: GitHub
URL: https://github.com/dodat-12/big-data-with-gcp
Owner: DoDat-12
Created: 2024-10-17T08:07:17.000Z (3 months ago)
Default Branch: main
Last Pushed: 2024-11-12T03:42:09.000Z (2 months ago)
Last Synced: 2024-11-13T22:36:30.036Z (2 months ago)
Language: Python
Homepage:
Size: 490 KB
Stars: 0
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Experimenting GCP for Big Data Project

![alt text](docs/pipeline.png)

## Prerequisite

- Account with Google Cloud Platform, Billing enabled (Skip)
- Create service Account with owner access (Skip)
- id: `tadod-sa-434`
- Go to manage keys and create `serviceKeyGoogle.json` key, store in this directory (put in .gitignore)
- Enable APIs (Skip)
- Compute Engine API
- Cloud Dataproc API
- Cloud Resource Manager API
- Set up virtual environment

py -m venv env
./env/Scripts/activate

- Python libraries

pip install -r requirements.txt

- Create Service Account with owner role, create key and save with name `serviceKeyGoogle.json`

- Test run

py setup_test.py

## Project Structure

- `gcs`
- `bucket.py` - function to create bucket on Google Cloud Storage
- `load_data.py` - functions to download data and upload to bucket on GCS
- `main.py` - execution file

- `dataproc`
- `cluster.py` - functions to manage dataproc cluster (create, update, delete, start, stop, submit job)
- `jobs` - contains PySpark jobs
- `wh_init.py` - init data warehouse on Google BigQuery (year: 2011)
- `wh_batch_load.py` - batch processing each year from 2012 to present
- `main.py` - execution file

- `bigquery`

- `docs` - files for README.md
- `setup_test.py` - check authen from local to GCP
- `serviceKeyGoogle.json` - Service Account key for authentication

## Project Infomation

- Project ID: `uber-analysis-439804`
- Region: `us-central1`
- Dataproc's Cluster Name: `uber-hadoop-spark-cluster`
- Bucket's Name:
- Raw data: `uber-{year}-154055`
- PySpark jobs and tmp dir: `uber-pyspark-jobs`

> pip freeze > requirements.txt