https://github.com/to-infinitee/data-infra-architecture
The Infra Diagram for managing data pipelines where data is ingested, stored, transformed, and then visualized for analysis. The use of tools like Airbyte, dbt, and Airflow indicates a modern data stack focused on efficiency and scalability.
https://github.com/to-infinitee/data-infra-architecture
airbyte airflow dbt ec2 github iac python snowflake vpc
Last synced: about 1 month ago
JSON representation
The Infra Diagram for managing data pipelines where data is ingested, stored, transformed, and then visualized for analysis. The use of tools like Airbyte, dbt, and Airflow indicates a modern data stack focused on efficiency and scalability.
- Host: GitHub
- URL: https://github.com/to-infinitee/data-infra-architecture
- Owner: to-infinitee
- Created: 2024-08-21T23:53:55.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-22T00:08:37.000Z (almost 2 years ago)
- Last Synced: 2025-06-01T22:57:10.351Z (about 1 year ago)
- Topics: airbyte, airflow, dbt, ec2, github, iac, python, snowflake, vpc
- Homepage:
- Size: 15.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

It outlines the architecture and implementation of a data pipeline designed to handle the generation, ingestion, storage, transformation, and visualization of data. The infrastructure is built using a modern data stack that leverages cloud services and open-source tools for scalability, security, and efficiency.
## Components Breakdown
### 1. **Data Generation**
- **EC2 Instance (Python Script)**
- **Description:** An Amazon EC2 instance running a Python script generates synthetic data every 2 hours.
- **Purpose:** Simulates real-time data generation for testing and development purposes.
- **Location:** Public subnet within a Virtual Private Cloud (VPC).
### 2. **Data Storage**
- **Amazon RDS Postgres**
- **Description:** A Postgres database hosted on Amazon RDS stores the data generated by the EC2 instance.
- **Purpose:** Acts as the primary storage for generated data.
- **Location:** Private subnet within the VPC for enhanced security.
### 3. **Data Ingestion**
- **Airbyte**
- **Description:** Airbyte is responsible for ingesting new data from the RDS Postgres database on a daily basis.
- **Purpose:** Extracts data from the operational database and loads it into the data warehouse.
### 4. **Data Warehouse**
- **Snowflake**
- **Description:** Snowflake is a cloud-based data warehouse where ingested data is stored.
- **Purpose:** Provides scalable and efficient storage for large datasets, allowing for fast query performance.
### 5. **Data Transformation**
- **dbt (Data Build Tool)**
- **Description:** dbt transforms raw data within Snowflake into a usable format.
- **Purpose:** Prepares data for analysis by cleaning, aggregating, and organizing it.
- **Apache Airflow**
- **Description:** Airflow schedules and orchestrates dbt jobs to ensure data is refreshed daily.
- **Purpose:** Automates the transformation process to maintain up-to-date data in the warehouse.
### 6. **Data Visualization**
- **Looker**
- **Description:** Looker connects to Snowflake to provide interactive data visualizations and dashboards.
- **Purpose:** Enables stakeholders to explore and analyze data through an intuitive interface.
### 7. **CI/CD, Version Control, and Infrastructure as Code**
- **GitHub & GitHub Actions**
- **Description:** GitHub is used for version control, and GitHub Actions automates CI/CD processes.
- **Purpose:** Manages code and infrastructure changes, ensuring consistency and reliability.
- **Docker**
- **Description:** Docker is used for containerizing applications, ensuring consistent environments.
- **Purpose:** Facilitates deployment across different environments.
- **Terraform**
- **Description:** Terraform is used for Infrastructure as Code (IaC) to provision cloud resources.
- **Purpose:** Automates the setup and management of infrastructure in a repeatable manner.
## Deployment
### Prerequisites
- AWS account with necessary permissions for EC2, RDS, and VPC.
- GitHub account with repositories for version control.
- Docker installed on your local machine.
- Terraform installed on your local machine.
- Access to a Snowflake account.
- Airbyte instance set up for data ingestion.
- Looker account for data visualization.