https://github.com/mikeacosta/airflow-data-pipeline
ETL Workflow Automation with Apache Airflow
https://github.com/mikeacosta/airflow-data-pipeline
airflow etl redshift
Last synced: 3 months ago
JSON representation
ETL Workflow Automation with Apache Airflow
- Host: GitHub
- URL: https://github.com/mikeacosta/airflow-data-pipeline
- Owner: mikeacosta
- Created: 2020-02-19T20:15:35.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-02-20T15:39:53.000Z (over 5 years ago)
- Last Synced: 2025-01-10T19:28:50.979Z (4 months ago)
- Topics: airflow, etl, redshift
- Language: Python
- Homepage:
- Size: 131 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Pipelines with Apache Airflow
Automated data pipeline workflows using [Apache Airflow](https://airflow.apache.org/) that loads and processes data from Amazon S3 into an [Amazon Redshift](https://aws.amazon.com/redshift/) cloud data warehouse for analytics processing.
## Background
The analytics team for music streaming startup Sparkify wants to automate and better monitor their data warehouse ETL pipelines using the Apache Airflow open-source workflow management platform.
## Objective
The goal of this project is to author a data pipeline workflow created with custom operators within the Airflow platform that perform tasks such as staging data, populating the data warehouse, and running quality checks. The end result will be a pipeline defintion as illustrated below.
The pipeline transforms the data into a set of fact and dimension tables in Redshift.
## Custom operators
The project includes reuseable operators that have been implemented into functional pieces of the data pipeline.
### Stage operator
Loads JSON formatted files from S3 into Redshift by running a SQL COPY statement based on parameter values for the S3 bucket and target Redshift table.
### Fact operator
Loads output data from the stage operator into a fact table. SQL statement and target table are passed as parameters.
### Dimension operator
Loads data from the fact table into dimension tables. SQL statement and target table are passed as parameters.
### Data quality operator
Runs qualtiy checks on the transformed data.
## Project files
In addition to files for implementing the above operator functionality, the project also includes:
- `sparkify_workflow_dag.py` - DAG (Directed Acyclic Graph) definition script
- `redshift.ipynb` - [Jupyter Notebook](https://jupyter.org/) for creating the Redshift cluster
- `create_tables.sql` - SQL for creating Redshift tables
- `sql_queries.py` - script with SQL for operators to import data
- `dwh.cfg` - configuration values for AWS services## Steps to run project
1. Execute the steps in the Jupyter Notebook to create the Redshift cluster
An AWS IAM user with the following policies (or equivalent permissions) is required:
- AmazonRedshiftFullAccess
- AmazonS3ReadOnlyAccess
- IAMFullAccess
- AmazonEC2FullAccessThe access key and secret key need to be added to the `[AWS]` section in the `dwh.cfg` file.
```
[AWS]
KEY=YOURACCESSKEYGOESHERE
SECRET=PUTyourSECRETaccessKEYhereTHISisREQUIRED
```2. Run queries in `create_tables.sql`
3. Start Airflow and toggle "ON" the DAG named sparkify_workflow