https://github.com/danieldacosta/spark-etl

ETL pipeline in Spark that loads data from s3, processes the data into analytics tables, and loads them back to s3.
https://github.com/danieldacosta/spark-etl

Last synced: 11 months ago
JSON representation

ETL pipeline in Spark that loads data from s3, processes the data into analytics tables, and loads them back to s3.

Host: GitHub
URL: https://github.com/danieldacosta/spark-etl
Owner: DanielDaCosta
License: apache-2.0
Created: 2020-11-04T01:00:50.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2020-12-12T17:52:50.000Z (over 5 years ago)
Last Synced: 2025-01-11T01:10:37.309Z (over 1 year ago)
Language: Python
Size: 785 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Building ETL Pipeline using Spark.
ETL pipeline using Spark that loads data from s3, processes the data into analytics tables, and loads them back into s3.

# AWS Credentials

We are using aws as environment variables in this repo:
```bash
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_DEFAULT_REGION=
```

# EMR Cluster and Pyspark Job

The EMR set up was done through the console, you can check the tutorial on this [link](https://www.youtube.com/watch?v=gOT7El8rMws) or [here](https://www.youtube.com/watch?v=r-ig8zpP3EM)

# References

- https://www.youtube.com/watch?v=gOT7El8rMws
- https://www.youtube.com/watch?v=r-ig8zpP3EM

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danieldacosta/spark-etl

Awesome Lists containing this project

README