https://github.com/evanmathew/reddit_etl_de
This project demonstrates a complete data pipeline for extracting, transforming, and loading (ETL) Reddit data into an Amazon Redshift data warehouse. The pipeline uses various AWS services and tools including Apache Airflow, PostgreSQL, AWS S3, AWS Glue, AWS Athena, and Amazon Redshift. The project is orchestrated using Docker and Apache Airflow
https://github.com/evanmathew/reddit_etl_de
airflow aws-catalog-api-handler aws-glue aws-quicksight aws-redshift aws-s3 data-engineering docker etl postgresql-database python reddit-api
Last synced: about 1 month ago
JSON representation
This project demonstrates a complete data pipeline for extracting, transforming, and loading (ETL) Reddit data into an Amazon Redshift data warehouse. The pipeline uses various AWS services and tools including Apache Airflow, PostgreSQL, AWS S3, AWS Glue, AWS Athena, and Amazon Redshift. The project is orchestrated using Docker and Apache Airflow
- Host: GitHub
- URL: https://github.com/evanmathew/reddit_etl_de
- Owner: evanmathew
- Created: 2024-07-13T15:46:44.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-14T09:38:42.000Z (almost 2 years ago)
- Last Synced: 2025-06-01T14:01:16.409Z (about 1 year ago)
- Topics: airflow, aws-catalog-api-handler, aws-glue, aws-quicksight, aws-redshift, aws-s3, data-engineering, docker, etl, postgresql-database, python, reddit-api
- Language: Python
- Homepage:
- Size: 137 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Reddit Data Engineering Project
## Overview
This project demonstrates a complete data pipeline for extracting, transforming, and loading (ETL) Reddit data into an Amazon Redshift data warehouse. The pipeline uses various AWS services and tools including Apache Airflow, PostgreSQL, AWS S3, AWS Glue, AWS Athena, and Amazon Redshift. The project is orchestrated using Docker and Apache Airflow to ensure a smooth workflow and ease of deployment.
## Architecture
### Data Source
- **Reddit API**: The source of the data. Reddit data is extracted using the Reddit API.
### Data Processing and Orchestration
- **Apache Airflow**: Used for orchestration of the data pipeline. Airflow manages the execution of tasks and ensures data flows through the pipeline.
- **PostgreSQL**: Used as the metadata database for Apache Airflow.
- **Celery**: Used for distributed task queueing to handle asynchronous tasks.
- **Docker**: Containers used for packaging and deploying the services.
### AWS Components
- **S3 Buckets**:
- **Raw Storage**: Stores raw data from Reddit.
- **Transformed Storage**: Stores transformed data ready for further processing and querying.
- **AWS Glue**:
- **Data Catalog**: Maintains metadata of the datasets stored in S3.
- **Crawlers**: Crawls data from S3 and populates the Data Catalog.
- **ETL (Extract, Transform, Load)**: Transforms and loads data from S3 to Redshift.
- **Amazon Athena**: Used for querying data stored in S3 using SQL.
- **Amazon Redshift**: A data warehouse where the final transformed data is stored for analysis.
- **AWS IAM**: Manages access and permissions for AWS services.
### BI and Analytics Tools
- **Power BI**
- **Amazon QuickSight**
- **Tableau**
- **Looker Studio**
## Data Flow
1. **Extraction**: Reddit data is extracted using the Reddit API and saved to the raw storage S3 bucket.
2. **Transformation**: Data is processed using AWS Glue, transforming it into a structured format.
3. **Loading**: The transformed data is loaded into Amazon Redshift for analysis.
## Architecture Diagram
