https://github.com/danieldacosta/spark-etl
ETL pipeline in Spark that loads data from s3, processes the data into analytics tables, and loads them back to s3.
https://github.com/danieldacosta/spark-etl
Last synced: 9 months ago
JSON representation
ETL pipeline in Spark that loads data from s3, processes the data into analytics tables, and loads them back to s3.
- Host: GitHub
- URL: https://github.com/danieldacosta/spark-etl
- Owner: DanielDaCosta
- License: apache-2.0
- Created: 2020-11-04T01:00:50.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2020-12-12T17:52:50.000Z (over 5 years ago)
- Last Synced: 2025-01-11T01:10:37.309Z (over 1 year ago)
- Language: Python
- Size: 785 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Building ETL Pipeline using Spark.
ETL pipeline using Spark that loads data from s3, processes the data into analytics tables, and loads them back into s3.
# AWS Credentials
We are using aws as environment variables in this repo:
```bash
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_DEFAULT_REGION=
```
# EMR Cluster and Pyspark Job
The EMR set up was done through the console, you can check the tutorial on this [link](https://www.youtube.com/watch?v=gOT7El8rMws) or [here](https://www.youtube.com/watch?v=r-ig8zpP3EM)
# References
- https://www.youtube.com/watch?v=gOT7El8rMws
- https://www.youtube.com/watch?v=r-ig8zpP3EM