{"id":29961485,"url":"https://github.com/camilajaviera91/gcp-new","last_synced_at":"2026-04-20T13:07:50.776Z","repository":{"id":305801727,"uuid":"1023263054","full_name":"CamilaJaviera91/gcp-new","owner":"CamilaJaviera91","description":"This project defines a modern data pipeline architecture using Airflow, DBT, and PostgreSQL. Below you'll find instructions on how to get started and how the repository is structured.","archived":false,"fork":false,"pushed_at":"2025-07-29T21:46:50.000Z","size":169,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-29T23:03:09.024Z","etag":null,"topics":["airflow","airflow-dags","bashoperator","docker-compose","dotenv","os","pandas","pipelines","psycopg2","pythonoperator"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CamilaJaviera91.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-20T21:42:06.000Z","updated_at":"2025-07-29T21:46:53.000Z","dependencies_parsed_at":"2025-07-22T05:25:02.383Z","dependency_job_id":"4d83ea30-b5f0-4bed-958a-619041e46f6d","html_url":"https://github.com/CamilaJaviera91/gcp-new","commit_stats":null,"previous_names":["camilajaviera91/gcp-new"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CamilaJaviera91/gcp-new","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fgcp-new","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fgcp-new/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fgcp-new/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fgcp-new/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CamilaJaviera91","download_url":"https://codeload.github.com/CamilaJaviera91/gcp-new/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fgcp-new/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32048474,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","airflow-dags","bashoperator","docker-compose","dotenv","os","pandas","pipelines","psycopg2","pythonoperator"],"created_at":"2025-08-03T23:10:59.022Z","updated_at":"2026-04-20T13:07:50.757Z","avatar_url":"https://github.com/CamilaJaviera91.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌀 Airflow + DBT + PostgreSQL Data Pipeline\n\nThis repository implements a modern, modular data pipeline using:\n\n- Apache Airflow for orchestration\n\n- DBT for SQL-based transformations\n\n- PostgreSQL as both the source/target database and metadata store\n\n\u003e 💡 Ideal for learning, development, and lightweight data integration projects.\n\n---\n\n## 📂 Project Structure\n\n```\n.\n├── 1_init.sh\n├── 2_reset_docker.sh\n├── 3_fix_permissions.sh\n├── credentials #gitignore\n│   └── auth.json\n├── dags\n│   └── dag.py\n├── dbt_project\n│   ├── dbt_project.yml\n│   ├── models\n│   │   ├── marts\n│   │   │   ├── final_report.sql\n│   │   │   └── sales_by_product.sql\n│   │   ├── schema.sql\n│   │   └── staging\n│   │       ├── clients.sql\n│   │       ├── orders.sql\n│   │       └── products.sql\n│   └── profiles.yml\n├── docker-compose.yml\n├── Dockerfile.airflow\n├── files\n│   ├── clients.csv\n│   ├── final_report.csv\n│   ├── orders.csv\n│   ├── products.csv\n│   └── sales_by_product.csv\n├── LICENSE\n├── README.md\n├── requirements.txt\n└── scripts\n    ├── extract\n    │   └── extract.py\n    ├── load\n    │   └── load_data.py\n    └── utils\n        └── utils.py\n```\n\n---\n\n## 🚀 Getting Started\n\nBefore running the pipeline, make sure to create the following folders in the root directory of the project:\n\n```\n.\n├── dags/               # Airflow DAG definitions\n├── dbt_project/        # DBT transformations and config\n│   └── models/\n│       ├── staging/    # Raw → Staging transformations\n│       └── marts/      # Staging → Marts (analytics-ready)\n├── files/              # CSVs, exports, mock datasets\n├── scripts/            # Python utilities for extract/load/validation\n│   ├── extract/\n│   ├── load/\n│   └── utils/\n```\n\n- `dags/`: Contains Airflow DAGs to orchestrate the pipeline.\n- `dbt_project/`: Contains the DBT project with all SQL transformation models.\n  - `models/`\n    - `staging/`: Contains staging models for cleaning and preparing raw data.\n    - `marts/`: Contains data marts for final models ready for analysis and reporting.\n- `files/`: Stores input/output files such as CSVs.\n- `scripts/`: Includes helper scripts for data extraction, validation, and loading.\n\n---\n\n## 🛠️ Prerequisites\n\nMake sure you have the following installed:\n\n- Python 3.10+\n- Docker \u0026 Docker Compose\n- DBT\n- Apache Airflow (v2+)\n- PostgreSQL\n\n---\n\n## 📦 Installation\n\n```bash\n# Clone the repository\ngit clone git@github.com:CamilaJaviera91/gcp-new.git\ncd gcp-new\n\n# Create required folders\nmkdir -p dags dbt_project/models/{staging,marts} files scripts/{extract,load,utils}\n```\n\n---\n\n## 🐳 Docker Setup with Airflow and PostgreSQL\n\nThis project uses Docker Compose to orchestrate the following services:\n\n| Service               | Description                                    |\n| --------------------- | ---------------------------------------------- |\n| **PostgreSQL**        | Stores raw/transformed data \u0026 Airflow metadata |\n| **Airflow Webserver** | UI to manage DAGs                              |\n| **Airflow Scheduler** | Triggers DAG tasks based on time or sensors    |\n| **Airflow Init**      | Initializes metadata DB, creates user          |\n\n\u003e Make sure the previous structure exists before launching the containers:\n\n---\n\n## ⚙️ .env Configuration\n\nCreate a .env file with the following (sample):\n\n```\n# Airflow\nAIRFLOW__CORE__EXECUTOR=...\nAIRFLOW__CORE__LOAD_EXAMPLES=...\nAIRFLOW__DATABASE__SQL_ALCHEMY_CONN=...\nAIRFLOW__WEBSERVER__SECRET_KEY=...\n\n# PostgreSQL\nPOSTGRES_SCHEMA=...\nPOSTGRES_HOST=...\nPOSTGRES_PORT=...\nPOSTGRES_DB=...\nPOSTGRES_USER=...\nPOSTGRES_PASSWORD=...\n\n# Bigquery\nGOOGLE_CREDENTIALS_PATH=...\nBQ_PROJECT_ID=....\nBQ_DATASET=...\n\n```\n\n---\n\n## 📦 Python Dependencies\n\nThis project uses a `requirements.txt` file to manage all Python dependencies needed for the data pipeline, including Airflow, DBT, PostgreSQL, testing, and development tools.\n\n#### 🔧 What's Included\n\n| Category | Package(s) | Purpose |\n| -------- | ---------- | ------- |\n| **DBT** | `dbt-core`, `dbt-postgres`, `dbt-bigquery` | DBT functionality for PostgreSQL and BigQuery |\n| **Airflow** | `apache-airflow==2.9.1`, `apache-airflow-providers-openlineage` | Workflow orchestration |\n| **Database** | `psycopg2-binary==2.9.9` | PostgreSQL connector used by Airflow and DBT |\n| **Compatibility** | `protobuf\u003c5`, `sqlparse\u003c0.5` | Ensures compatibility with DBT and Airflow |\n| **Environment Variables** | `python-dotenv==1.1.0` | Loads `.env` files for secure and flexible config |\n| **Synthetic Data** | `faker==24.9.0` | Generate fake data for testing or mock pipelines  |\n| **Testing** | `pytest`, `pytest-mock`  Unit testing and mocking for pipeline components |\n| **Code Quality** | `black`, `flake8`, `isort` | Code formatting, linting, and import sorting |\n| **Data Analysis** | `numpy`, `pandas`, `matplotlib` | Analyze, transform, and visualize data in Python |\n| **GoogleSheets Integration**|`gspread`, `gspread-dataframe`, `oauth2client`| Interact with GoogleSheets via API |\n\n### 🛠️ Docker Compose Setup\n\nSample docker-compose.yml setup is included in the repo and features:\n\n- PostgreSQL with persistent volume\n\n- Airflow Webserver, Scheduler, Init\n\n- Custom Dockerfile for Airflow + DBT + Python deps\n\n\u003e ✅ Make sure volumes: in each service are properly mapped to ./dags, ./scripts, etc.\n\n### 🌀 Dockerfile.airflow\n\nThis file sets up the Airflow environment with Python dependencies and your DBT project.\n\n```\nFROM apache/airflow:2.10.0-python3.11\n\nUSER root\nRUN apt-get update \u0026\u0026 apt-get install -y build-essential git\n\nUSER airflow\n\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\n```\n\n---\n\n## ⚙️ Helper Scripts\n\nTo simplify setup and maintenance, the project includes the following Bash scripts:\n\n| Script                 | Description                                |\n| ---------------------- | ------------------------------------------ |\n| `1_init.sh`            | Initialize Airflow DB, create admin user   |\n| `2_reset_docker.sh`    | Reset all containers, volumes, and rebuild |\n| `3_fix_permissions.sh` | Fix volume permissions (Linux only)        |\n\n---\n\n## 🌐 Access the Airflow Web UI\n\nOnce the containers are up and the initialization step has been completed, you can access the **Apache Airflow** web interface to monitor, manage, and trigger your DAGs.\n\n### 🔗 Open in your browser\n\n[Localhost:8080](http://localhost:8080)\n\nThis URL points to the **Airflow webserver** running inside the Docker container and exposed on your local machine's port `8080`.\n\n### 🔐 Default login credentials\n\nIf you used the initialization script (`./1_init.sh`), the following admin user was created automatically:\n\n```\nUsername: admin  \nPassword: admin\n```\n\n\u003e 💡 You can customize these credentials by modifying the `airflow users create` command inside the `airflow-init` service or the `1_init.sh` script.\n\n### 🖥️ What you’ll see\n\nAfter logging in, you’ll be able to:\n\n- View all DAGs in the `dags/` folder\n- Trigger DAGs manually or wait for scheduled runs\n- Monitor task statuses and inspect logs\n- Manage Airflow Connections, Variables, and Pools\n- Access admin configurations and user management\n\n### 🛠️ Troubleshooting DAGs\nIf DAGs don't appear:\n\n- Check that dags/*.py files define a DAG object\n\n- Use: docker compose logs -f airflow-webserver for debug\n\n---\n\n## 📈 What’s Next?\nThis pipeline is ready for:\n\n- [X] 💡 Building DAGs with Python and Airflow\n- [X] 📤 Exporting data to CSV or Google Sheets\n- [X] 🔗 Connecting to BigQuery\n- [ ] 📊 Creating visualization\n- [ ] 🧠 Modeling datasets with DBT and version control\n\n---\n\n## 📬 Feedback or Questions?\nFeel free to open an issue or submit a PR!","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamilajaviera91%2Fgcp-new","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamilajaviera91%2Fgcp-new","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamilajaviera91%2Fgcp-new/lists"}