https://github.com/akarce/udacity-data-pipeline-with-airflow
Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.
https://github.com/akarce/udacity-data-pipeline-with-airflow
airflow minio postgresql pyspark spark
Last synced: 5 months ago
JSON representation
Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.
- Host: GitHub
- URL: https://github.com/akarce/udacity-data-pipeline-with-airflow
- Owner: akarce
- Created: 2024-07-14T19:46:50.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-08-13T14:20:38.000Z (11 months ago)
- Last Synced: 2025-02-02T01:31:48.474Z (5 months ago)
- Topics: airflow, minio, postgresql, pyspark, spark
- Language: Python
- Homepage: https://medium.com/@akarce/udacity-data-pipeline-with-airflow-60f04f99efed
- Size: 84.6 MB
- Stars: 4
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Udacity-Data-Pipeline-with-Airflow
Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.## Project Structure:
+ dags: Directory containing Airflow DAG scripts.
+ create_tables_and_buckets.py: DAG for creating tables on PostgreSQL and buckets on MinIO.
+ create_tables.sql: SQL script containing CREATE TABLE queries.
+ pipeline_dag.py: Main DAG script for the ETL data pipeline.+ data: Directory for storing project source data.
+ log_data: Subdirectory for log data.
+ song_data: Subdirectory for song data.+ plugins: Directory for custom Airflow plugins.
+ operators: Subdirectory for operator scripts.
+ init.py: Initialization script for operators.
+ stage_postgresql_operator.py: Operator script for copying data from MinIO to PostgreSQL.
+ load_fact_operator.py: Operator script for executing INSERT queries into the fact table.
+ load_dimension_operator.py: Operator script for executing INSERT queries into dimension tables.
+ data_quality_operator.py: Operator script for performing data quality checks.+ helpers: Subdirectory for helper scripts.
+ __init__.py: Initialization script for helpers.
+ sql_queries.py: Script containing SQL queries for building dimensional tables.### Airflow Dag Overview

### Staging & Fact Table Schema

### Dimension Tables Schema

# How to Run
## Clone the project repository
`$ git clone https://github.com/akarce/Udacity-Data-Pipeline-with-Airflow`
### Change directory
`$ cd Udacity-Data-Pipeline-with-Airflow`
## Build the docker images and start airflow:
`$ docker-compose up airflow-init `
`$ docker compose up -d`### Create database called sparkifydb:
`$ docker exec -it postgresdev psql -U postgres_user -d postgres_db`
`$ CREATE DATABASE sparkifydb; `

### WebUI links: \
`Airflow webserver` http://localhost:8080/`MinIO` http://localhost:9001/
### Sign in to Airflow webserver:
`Username:` airflow
`Password:` airflow
### Create Postgres connection:
#### Go to admin -> Connections -> Add a new record
Connection Id: postgres_conn \
Connection Type: Postgres \
Host: 172.18.0.1 \
Database: sparkifydb \
Login: postgres_user \
Password: postgres_password \
Port: 5432
### Obtain Access Key from MinIO:
Go to MinIO WebUI using http://localhost:9001/ \
From Access Keys section -> create access key -> Create \
Store you Access Key and Secret Key using Download for Import
### Create MinIo connection:
#### Go to admin -> Connections -> Add a new record
Connection Id: minio_conn \
Connection Type: Amazon Web Services \
AWS Access Key ID: \
AWS Secret Access Key: \
### Trigger the create_tables_and_buckets_dag using Airflow WebUI

This will create two buckets named \
udacity-dend, processed \
and 7 postgres tables named \
artists, songplays, songs, staging_events, staging_songs, times, users

### To see the schemas of tables use \
`$ \d `
### Upload the log_data and song_data folders from data folder into udacity-dend bucket from MinIO WebUI
`Username:` minioadmin \
`Password:` minioadmin
### Finally unpause the dag named pipeline:

#### After finished successfully you can check the logs for data quality messages, and the all objects will be moved to the processed bucket.

#### You can see the Airflow logs for data quality results:

=======
>>>>>>> 1b712ebd7a3d8fd64ccb00116f6c053a5f10cde7