Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mananapr/reddit_india_pipeline
Pipeline that scrapes data from r/india subreddit and finalizes data for the visual layer
https://github.com/mananapr/reddit_india_pipeline
airflow aws data-engineering etl etl-pipeline metabase terraform web-scraping
Last synced: about 1 month ago
JSON representation
Pipeline that scrapes data from r/india subreddit and finalizes data for the visual layer
- Host: GitHub
- URL: https://github.com/mananapr/reddit_india_pipeline
- Owner: mananapr
- License: mit
- Created: 2024-02-01T12:38:34.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-02-05T15:51:33.000Z (11 months ago)
- Last Synced: 2024-10-15T18:26:52.349Z (3 months ago)
- Topics: airflow, aws, data-engineering, etl, etl-pipeline, metabase, terraform, web-scraping
- Language: Python
- Homepage:
- Size: 160 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Reddit India Pipeline
Pipeline that scrapes data from [r/india](https://old.reddit.com/r/india) subreddit and finalizes data for the visual layer.## Architecture
![flowchart](flowchart.png)- **Infra Provisioning:** Terraform (with AWS)
- **Containerization:** Docker
- **Orchestration:** Airflow
- **Visual Layer:** Metabase### DAG Tasks:
1. Scrape data from r/india to generate bronze data
2. Validate using Pydantic and load data to S3
3. Generate and valiate silver data and load to S3
4. Load silver data into Redshift## Requirements
1. AWS CLI and Terraform for infra provisioning
2. Docker for Airflow and DAG execution## Setup
Setup and intial execution is handled by the Makefile.
1. `make init`: Intializes Airflow (User setup, DB migrations)
2. `make infra`: Sets up the AWS Infrastructure (S3, Redshift, Budget) and creates the `configuration.env` file with the secrets
3. `make up`: Runs Airflow## Dashboard
![dashboard](dashboard.jpg)