https://github.com/camilajaviera91/gcp-new
This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.
https://github.com/camilajaviera91/gcp-new
airflow airflow-dags bashoperator docker-compose dotenv os pandas pipelines psycopg2 pythonoperator
Last synced: about 1 month ago
JSON representation
This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.
- Host: GitHub
- URL: https://github.com/camilajaviera91/gcp-new
- Owner: CamilaJaviera91
- License: mit
- Created: 2025-07-20T21:42:06.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-29T21:46:50.000Z (10 months ago)
- Last Synced: 2025-07-29T23:03:09.024Z (10 months ago)
- Topics: airflow, airflow-dags, bashoperator, docker-compose, dotenv, os, pandas, pipelines, psycopg2, pythonoperator
- Language: Python
- Homepage:
- Size: 165 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π Airflow + DBT + PostgreSQL Data Pipeline
This repository implements a modern, modular data pipeline using:
- Apache Airflow for orchestration
- DBT for SQL-based transformations
- PostgreSQL as both the source/target database and metadata store
> π‘ Ideal for learning, development, and lightweight data integration projects.
---
## π Project Structure
```
.
βββ 1_init.sh
βββ 2_reset_docker.sh
βββ 3_fix_permissions.sh
βββ credentials #gitignore
β βββ auth.json
βββ dags
β βββ dag.py
βββ dbt_project
β βββ dbt_project.yml
β βββ models
β β βββ marts
β β β βββ final_report.sql
β β β βββ sales_by_product.sql
β β βββ schema.sql
β β βββ staging
β β βββ clients.sql
β β βββ orders.sql
β β βββ products.sql
β βββ profiles.yml
βββ docker-compose.yml
βββ Dockerfile.airflow
βββ files
β βββ clients.csv
β βββ final_report.csv
β βββ orders.csv
β βββ products.csv
β βββ sales_by_product.csv
βββ LICENSE
βββ README.md
βββ requirements.txt
βββ scripts
βββ extract
β βββ extract.py
βββ load
β βββ load_data.py
βββ utils
βββ utils.py
```
---
## π Getting Started
Before running the pipeline, make sure to create the following folders in the root directory of the project:
```
.
βββ dags/ # Airflow DAG definitions
βββ dbt_project/ # DBT transformations and config
β βββ models/
β βββ staging/ # Raw β Staging transformations
β βββ marts/ # Staging β Marts (analytics-ready)
βββ files/ # CSVs, exports, mock datasets
βββ scripts/ # Python utilities for extract/load/validation
β βββ extract/
β βββ load/
β βββ utils/
```
- `dags/`: Contains Airflow DAGs to orchestrate the pipeline.
- `dbt_project/`: Contains the DBT project with all SQL transformation models.
- `models/`
- `staging/`: Contains staging models for cleaning and preparing raw data.
- `marts/`: Contains data marts for final models ready for analysis and reporting.
- `files/`: Stores input/output files such as CSVs.
- `scripts/`: Includes helper scripts for data extraction, validation, and loading.
---
## π οΈ Prerequisites
Make sure you have the following installed:
- Python 3.10+
- Docker & Docker Compose
- DBT
- Apache Airflow (v2+)
- PostgreSQL
---
## π¦ Installation
```bash
# Clone the repository
git clone git@github.com:CamilaJaviera91/gcp-new.git
cd gcp-new
# Create required folders
mkdir -p dags dbt_project/models/{staging,marts} files scripts/{extract,load,utils}
```
---
## π³ Docker Setup with Airflow and PostgreSQL
This project uses Docker Compose to orchestrate the following services:
| Service | Description |
| --------------------- | ---------------------------------------------- |
| **PostgreSQL** | Stores raw/transformed data & Airflow metadata |
| **Airflow Webserver** | UI to manage DAGs |
| **Airflow Scheduler** | Triggers DAG tasks based on time or sensors |
| **Airflow Init** | Initializes metadata DB, creates user |
> Make sure the previous structure exists before launching the containers:
---
## βοΈ .env Configuration
Create a .env file with the following (sample):
```
# Airflow
AIRFLOW__CORE__EXECUTOR=...
AIRFLOW__CORE__LOAD_EXAMPLES=...
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=...
AIRFLOW__WEBSERVER__SECRET_KEY=...
# PostgreSQL
POSTGRES_SCHEMA=...
POSTGRES_HOST=...
POSTGRES_PORT=...
POSTGRES_DB=...
POSTGRES_USER=...
POSTGRES_PASSWORD=...
# Bigquery
GOOGLE_CREDENTIALS_PATH=...
BQ_PROJECT_ID=....
BQ_DATASET=...
```
---
## π¦ Python Dependencies
This project uses a `requirements.txt` file to manage all Python dependencies needed for the data pipeline, including Airflow, DBT, PostgreSQL, testing, and development tools.
#### π§ What's Included
| Category | Package(s) | Purpose |
| -------- | ---------- | ------- |
| **DBT** | `dbt-core`, `dbt-postgres`, `dbt-bigquery` | DBT functionality for PostgreSQL and BigQuery |
| **Airflow** | `apache-airflow==2.9.1`, `apache-airflow-providers-openlineage` | Workflow orchestration |
| **Database** | `psycopg2-binary==2.9.9` | PostgreSQL connector used by Airflow and DBT |
| **Compatibility** | `protobuf<5`, `sqlparse<0.5` | Ensures compatibility with DBT and Airflow |
| **Environment Variables** | `python-dotenv==1.1.0` | Loads `.env` files for secure and flexible config |
| **Synthetic Data** | `faker==24.9.0` | Generate fake data for testing or mock pipelines |
| **Testing** | `pytest`, `pytest-mock` Unit testing and mocking for pipeline components |
| **Code Quality** | `black`, `flake8`, `isort` | Code formatting, linting, and import sorting |
| **Data Analysis** | `numpy`, `pandas`, `matplotlib` | Analyze, transform, and visualize data in Python |
| **GoogleSheets Integration**|`gspread`, `gspread-dataframe`, `oauth2client`| Interact with GoogleSheets via API |
### π οΈ Docker Compose Setup
Sample docker-compose.yml setup is included in the repo and features:
- PostgreSQL with persistent volume
- Airflow Webserver, Scheduler, Init
- Custom Dockerfile for Airflow + DBT + Python deps
> β
Make sure volumes: in each service are properly mapped to ./dags, ./scripts, etc.
### π Dockerfile.airflow
This file sets up the Airflow environment with Python dependencies and your DBT project.
```
FROM apache/airflow:2.10.0-python3.11
USER root
RUN apt-get update && apt-get install -y build-essential git
USER airflow
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
```
---
## βοΈ Helper Scripts
To simplify setup and maintenance, the project includes the following Bash scripts:
| Script | Description |
| ---------------------- | ------------------------------------------ |
| `1_init.sh` | Initialize Airflow DB, create admin user |
| `2_reset_docker.sh` | Reset all containers, volumes, and rebuild |
| `3_fix_permissions.sh` | Fix volume permissions (Linux only) |
---
## π Access the Airflow Web UI
Once the containers are up and the initialization step has been completed, you can access the **Apache Airflow** web interface to monitor, manage, and trigger your DAGs.
### π Open in your browser
[Localhost:8080](http://localhost:8080)
This URL points to the **Airflow webserver** running inside the Docker container and exposed on your local machine's port `8080`.
### π Default login credentials
If you used the initialization script (`./1_init.sh`), the following admin user was created automatically:
```
Username: admin
Password: admin
```
> π‘ You can customize these credentials by modifying the `airflow users create` command inside the `airflow-init` service or the `1_init.sh` script.
### π₯οΈ What youβll see
After logging in, youβll be able to:
- View all DAGs in the `dags/` folder
- Trigger DAGs manually or wait for scheduled runs
- Monitor task statuses and inspect logs
- Manage Airflow Connections, Variables, and Pools
- Access admin configurations and user management
### π οΈ Troubleshooting DAGs
If DAGs don't appear:
- Check that dags/*.py files define a DAG object
- Use: docker compose logs -f airflow-webserver for debug
---
## π Whatβs Next?
This pipeline is ready for:
- [X] π‘ Building DAGs with Python and Airflow
- [X] π€ Exporting data to CSV or Google Sheets
- [X] π Connecting to BigQuery
- [ ] π Creating visualization
- [ ] π§ Modeling datasets with DBT and version control
---
## π¬ Feedback or Questions?
Feel free to open an issue or submit a PR!