https://github.com/mikeacosta/data-lake-spark
Data lake ETL pipeline in Apache Spark
https://github.com/mikeacosta/data-lake-spark
apache-spark aws emr s3
Last synced: 4 months ago
JSON representation
Data lake ETL pipeline in Apache Spark
- Host: GitHub
- URL: https://github.com/mikeacosta/data-lake-spark
- Owner: mikeacosta
- Created: 2020-01-27T08:14:52.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-01-27T08:24:42.000Z (over 5 years ago)
- Last Synced: 2025-02-22T09:45:49.109Z (8 months ago)
- Topics: apache-spark, aws, emr, s3
- Language: Python
- Homepage:
- Size: 109 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Lake with Apache Spark
Data lake and ETL pipeline in Apache Spark that loads data from Amazon S3 and processes the data into analytics tables which are loaded back to S3.
## Background
The analytics team for music streaming startup Sparkify wants to analyze the song-listening activity of their users. This analysis will be based on mobile app JSON user activity logs and song metadata that exists in Amazon S3.## Datasets
The datasets are in a single Amazon S3 bucket, with song and log datasets in song_data and log_data folders, respectively.### Song data
Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.```
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
```
Below is an example of the content in a single song file.
```
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
```### Log data
Log files are partitioned by year and month.
```
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
```
This image below illustrates what the data in a log file looks like.
## Analytics tables
The ETL pipeline will create a star schema of analytics tables. Star schema models allow for simple queries and performance optimization for read-only analysis.### Fact table
1. **songplays** - ecords in log data associated with song plays i.e. records with page `NextSong`- *songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*
### Dimension tables
2. **users**
- *user_id, first_name, last_name, gender, level*
3. **songs**
- *song_id, title, artist_id, year, duration*
4. **artists**
- *artist_id, name, location, lattitude, longitude*
5. **time**
- *tart_time, hour, day, week, month, year, weekday*## Project files
- `etl.py` - Python script for running the ETL pipeline
- `README.md` - a summary of the project, how to run the Python scripts, and an explanation of the files in the repository
- `dl.cfg` - config file for AWS credentials## Steps to run project
In `etl.py`, modify variable `output_data` to refer to your destination S3 bucket/folder.
### Local environment
1. Add AWS access key and secret key in `dl.cfg`
- An account with read and write access on S3 is required
```
[AWS]
AWS_ACCESS_KEY_ID=YOURACCESSKEYGOESHERE
AWS_SECRET_ACCESS_KEY=PUTyourSECRETaccessKEYhereTHISisREQUIRED
```
2. Open command prompt3. Execute ETL pipeline
```
python etl.py
```### Amazon EMR
1. Create an EMR cluster with Cluster launch mode and configured for Spark applications
- For security, choose an [EC2 key pair .pem file](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)2. Upload `etl.py` to an S3 bucket
3. [SSH into the EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html)
4. Download `etl.py` from S3
```
aws s3 cp s3://my-s3-bucket/etl.py
```
5. Run Spark job
```
spark-submit etl.py
```