An open API service indexing awesome lists of open source software.

https://github.com/camilajaviera91/gcp-new

This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.
https://github.com/camilajaviera91/gcp-new

airflow airflow-dags bashoperator docker-compose dotenv os pandas pipelines psycopg2 pythonoperator

Last synced: about 1 month ago
JSON representation

This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.

Awesome Lists containing this project

README

          

# πŸŒ€ Airflow + DBT + PostgreSQL Data Pipeline

This repository implements a modern, modular data pipeline using:

- Apache Airflow for orchestration

- DBT for SQL-based transformations

- PostgreSQL as both the source/target database and metadata store

> πŸ’‘ Ideal for learning, development, and lightweight data integration projects.

---

## πŸ“‚ Project Structure

```
.
β”œβ”€β”€ 1_init.sh
β”œβ”€β”€ 2_reset_docker.sh
β”œβ”€β”€ 3_fix_permissions.sh
β”œβ”€β”€ credentials #gitignore
β”‚ └── auth.json
β”œβ”€β”€ dags
β”‚ └── dag.py
β”œβ”€β”€ dbt_project
β”‚ β”œβ”€β”€ dbt_project.yml
β”‚ β”œβ”€β”€ models
β”‚ β”‚ β”œβ”€β”€ marts
β”‚ β”‚ β”‚ β”œβ”€β”€ final_report.sql
β”‚ β”‚ β”‚ └── sales_by_product.sql
β”‚ β”‚ β”œβ”€β”€ schema.sql
β”‚ β”‚ └── staging
β”‚ β”‚ β”œβ”€β”€ clients.sql
β”‚ β”‚ β”œβ”€β”€ orders.sql
β”‚ β”‚ └── products.sql
β”‚ └── profiles.yml
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile.airflow
β”œβ”€β”€ files
β”‚ β”œβ”€β”€ clients.csv
β”‚ β”œβ”€β”€ final_report.csv
β”‚ β”œβ”€β”€ orders.csv
β”‚ β”œβ”€β”€ products.csv
β”‚ └── sales_by_product.csv
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── scripts
β”œβ”€β”€ extract
β”‚ └── extract.py
β”œβ”€β”€ load
β”‚ └── load_data.py
└── utils
└── utils.py
```

---

## πŸš€ Getting Started

Before running the pipeline, make sure to create the following folders in the root directory of the project:

```
.
β”œβ”€β”€ dags/ # Airflow DAG definitions
β”œβ”€β”€ dbt_project/ # DBT transformations and config
β”‚ └── models/
β”‚ β”œβ”€β”€ staging/ # Raw β†’ Staging transformations
β”‚ └── marts/ # Staging β†’ Marts (analytics-ready)
β”œβ”€β”€ files/ # CSVs, exports, mock datasets
β”œβ”€β”€ scripts/ # Python utilities for extract/load/validation
β”‚ β”œβ”€β”€ extract/
β”‚ β”œβ”€β”€ load/
β”‚ └── utils/
```

- `dags/`: Contains Airflow DAGs to orchestrate the pipeline.
- `dbt_project/`: Contains the DBT project with all SQL transformation models.
- `models/`
- `staging/`: Contains staging models for cleaning and preparing raw data.
- `marts/`: Contains data marts for final models ready for analysis and reporting.
- `files/`: Stores input/output files such as CSVs.
- `scripts/`: Includes helper scripts for data extraction, validation, and loading.

---

## πŸ› οΈ Prerequisites

Make sure you have the following installed:

- Python 3.10+
- Docker & Docker Compose
- DBT
- Apache Airflow (v2+)
- PostgreSQL

---

## πŸ“¦ Installation

```bash
# Clone the repository
git clone git@github.com:CamilaJaviera91/gcp-new.git
cd gcp-new

# Create required folders
mkdir -p dags dbt_project/models/{staging,marts} files scripts/{extract,load,utils}
```

---

## 🐳 Docker Setup with Airflow and PostgreSQL

This project uses Docker Compose to orchestrate the following services:

| Service | Description |
| --------------------- | ---------------------------------------------- |
| **PostgreSQL** | Stores raw/transformed data & Airflow metadata |
| **Airflow Webserver** | UI to manage DAGs |
| **Airflow Scheduler** | Triggers DAG tasks based on time or sensors |
| **Airflow Init** | Initializes metadata DB, creates user |

> Make sure the previous structure exists before launching the containers:

---

## βš™οΈ .env Configuration

Create a .env file with the following (sample):

```
# Airflow
AIRFLOW__CORE__EXECUTOR=...
AIRFLOW__CORE__LOAD_EXAMPLES=...
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=...
AIRFLOW__WEBSERVER__SECRET_KEY=...

# PostgreSQL
POSTGRES_SCHEMA=...
POSTGRES_HOST=...
POSTGRES_PORT=...
POSTGRES_DB=...
POSTGRES_USER=...
POSTGRES_PASSWORD=...

# Bigquery
GOOGLE_CREDENTIALS_PATH=...
BQ_PROJECT_ID=....
BQ_DATASET=...

```

---

## πŸ“¦ Python Dependencies

This project uses a `requirements.txt` file to manage all Python dependencies needed for the data pipeline, including Airflow, DBT, PostgreSQL, testing, and development tools.

#### πŸ”§ What's Included

| Category | Package(s) | Purpose |
| -------- | ---------- | ------- |
| **DBT** | `dbt-core`, `dbt-postgres`, `dbt-bigquery` | DBT functionality for PostgreSQL and BigQuery |
| **Airflow** | `apache-airflow==2.9.1`, `apache-airflow-providers-openlineage` | Workflow orchestration |
| **Database** | `psycopg2-binary==2.9.9` | PostgreSQL connector used by Airflow and DBT |
| **Compatibility** | `protobuf<5`, `sqlparse<0.5` | Ensures compatibility with DBT and Airflow |
| **Environment Variables** | `python-dotenv==1.1.0` | Loads `.env` files for secure and flexible config |
| **Synthetic Data** | `faker==24.9.0` | Generate fake data for testing or mock pipelines |
| **Testing** | `pytest`, `pytest-mock` Unit testing and mocking for pipeline components |
| **Code Quality** | `black`, `flake8`, `isort` | Code formatting, linting, and import sorting |
| **Data Analysis** | `numpy`, `pandas`, `matplotlib` | Analyze, transform, and visualize data in Python |
| **GoogleSheets Integration**|`gspread`, `gspread-dataframe`, `oauth2client`| Interact with GoogleSheets via API |

### πŸ› οΈ Docker Compose Setup

Sample docker-compose.yml setup is included in the repo and features:

- PostgreSQL with persistent volume

- Airflow Webserver, Scheduler, Init

- Custom Dockerfile for Airflow + DBT + Python deps

> βœ… Make sure volumes: in each service are properly mapped to ./dags, ./scripts, etc.

### πŸŒ€ Dockerfile.airflow

This file sets up the Airflow environment with Python dependencies and your DBT project.

```
FROM apache/airflow:2.10.0-python3.11

USER root
RUN apt-get update && apt-get install -y build-essential git

USER airflow

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
```

---

## βš™οΈ Helper Scripts

To simplify setup and maintenance, the project includes the following Bash scripts:

| Script | Description |
| ---------------------- | ------------------------------------------ |
| `1_init.sh` | Initialize Airflow DB, create admin user |
| `2_reset_docker.sh` | Reset all containers, volumes, and rebuild |
| `3_fix_permissions.sh` | Fix volume permissions (Linux only) |

---

## 🌐 Access the Airflow Web UI

Once the containers are up and the initialization step has been completed, you can access the **Apache Airflow** web interface to monitor, manage, and trigger your DAGs.

### πŸ”— Open in your browser

[Localhost:8080](http://localhost:8080)

This URL points to the **Airflow webserver** running inside the Docker container and exposed on your local machine's port `8080`.

### πŸ” Default login credentials

If you used the initialization script (`./1_init.sh`), the following admin user was created automatically:

```
Username: admin
Password: admin
```

> πŸ’‘ You can customize these credentials by modifying the `airflow users create` command inside the `airflow-init` service or the `1_init.sh` script.

### πŸ–₯️ What you’ll see

After logging in, you’ll be able to:

- View all DAGs in the `dags/` folder
- Trigger DAGs manually or wait for scheduled runs
- Monitor task statuses and inspect logs
- Manage Airflow Connections, Variables, and Pools
- Access admin configurations and user management

### πŸ› οΈ Troubleshooting DAGs
If DAGs don't appear:

- Check that dags/*.py files define a DAG object

- Use: docker compose logs -f airflow-webserver for debug

---

## πŸ“ˆ What’s Next?
This pipeline is ready for:

- [X] πŸ’‘ Building DAGs with Python and Airflow
- [X] πŸ“€ Exporting data to CSV or Google Sheets
- [X] πŸ”— Connecting to BigQuery
- [ ] πŸ“Š Creating visualization
- [ ] 🧠 Modeling datasets with DBT and version control

---

## πŸ“¬ Feedback or Questions?
Feel free to open an issue or submit a PR!