{"id":26893557,"url":"https://github.com/gades-dataeng/webinar","last_synced_at":"2025-03-31T23:58:21.605Z","repository":{"id":282963437,"uuid":"923012524","full_name":"GADES-DATAENG/webinar","owner":"GADES-DATAENG","description":"Code, scripts, and resources for the Data Engineering Fundamentals Course Webinar, covering Python, data pipelines, Apache Airflow, and more.","archived":false,"fork":false,"pushed_at":"2025-01-27T14:37:22.000Z","size":28256,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-17T22:41:52.375Z","etag":null,"topics":["apache-airflow","data-engineering","data-orchestration","data-orchestrator","data-pipelines","dimensional-modeling","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GADES-DATAENG.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-27T13:50:25.000Z","updated_at":"2025-02-21T10:34:03.000Z","dependencies_parsed_at":"2025-03-17T22:51:56.520Z","dependency_job_id":null,"html_url":"https://github.com/GADES-DATAENG/webinar","commit_stats":null,"previous_names":["gades-dataeng/webinar"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GADES-DATAENG%2Fwebinar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GADES-DATAENG%2Fwebinar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GADES-DATAENG%2Fwebinar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GADES-DATAENG%2Fwebinar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GADES-DATAENG","download_url":"https://codeload.github.com/GADES-DATAENG/webinar/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246558113,"owners_count":20796696,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","data-engineering","data-orchestration","data-orchestrator","data-pipelines","dimensional-modeling","python","sql"],"created_at":"2025-03-31T23:58:21.146Z","updated_at":"2025-03-31T23:58:21.597Z","avatar_url":"https://github.com/GADES-DATAENG.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PySQLshop Data Repository\n\nThis repository contains the code necessary to build the data pipeline for **PySQLshop**, a fictional store. The code is designed for educational purposes and demonstrates how to set up and manage a data pipeline using **Airflow** and **DBT (Data Build Tool)** for data transformations with **BigQuery** as the target.\n\nIn this project, we use Docker to containerize Airflow, DBT, and related services. The DBT process uses the `dbt-bigquery` adapter to connect to Google BigQuery for data transformation tasks.\n\n## Prerequisites\n\nBefore running this project, you need to have the following prerequisites:\n\n1. **Docker** installed on your machine to build and run the containers.\n2. **Docker Compose** installed to manage multi-container Docker applications.\n3. **Google Cloud Service Account** with access to BigQuery, and the corresponding **JSON key file**.\n\n   - Create a service account in Google Cloud with access to BigQuery. \n   - Download the service account key JSON file and place it in the root of this repository as `bigquery-service-account.json`.\n\n## Setup Instructions\n\n### Step 1: Clone the Repository\n\nIf you haven't already cloned the repository, you can do so by running the following command:\n\n```bash\ngit clone git@github.com:GADES-DATAENG/webinar.git\ncd webinar\n```\n\n### Step 2: Create your .env file\nBefore starting the services, you need to build the .env file with some variables. Please check the .env.template file and use it as\na template for your .env file.\n```bash\ncp .env.template .env\n```\n\n### Step 3: Get your GCP service account JSON credentials file\nAfter downloading your GCP service account JSON credentials file, just past it under the keys folder with the name `gcp-key.json`\n\n### Step 4: Build the Docker Image for DBT\nBefore starting the services, you need to build the DBT Docker image. Run the following command inside the repository folder:\n```bash\ndocker build -t dbt-core .\n```\n\nThis will build the `dbt-core` image based on the `Dockerfile` in the repository.\n\n### Step 5: Start the Services with Docker Compose\nOnce the image is built, you can start the services (Airflow, DBT, and other dependencies) using Docker Compose. Run the following command:\n```bash\ndocker-compose up -d\n```\n\nThis command will start all the containers defined in the `docker-compose.yml` file. It will set up Airflow, DBT, and any necessary services, including BigQuery integration.\n\n### Step 6: Access the Services\n- **Airflow Web UI**: You can access the Airflow web interface at http://localhost:8080\n    - Default login credentials are\n        - **Username**: `airflow`\n        - **Password**: `airflow`\n- **DBT**: The DBT transformation will run inside the DBT container, triggered by the Airflow DAG\n\n## Environment Setup\n- The DBT container uses `dbt-bigquery` to interact with Google BigQuery\n- The service account key file (`gcp-key.json`) should be inside the `keys` folder. DBT will use this file to authenticate and interact with BigQuery\n\nEnsure that the key file is placed correctly in the repository folder as:\n```bash\n/webinar/gcp-key.json\n```\n### Step 7: Running the DAG\nAirflow will trigger the DBT transformations according to the defined DAGs. You can monitor the progress of your tasks in the Airflow UI and view the logs for any issues or success.\n\n## Example Command to Run DBT Manually (if needed)\nIf you need to run DBT manually inside the container, you can use the following command:\n```bash\ndocker exec -it dbt-container dbt run\n```\nThis will execute the DBT transformations inside the running container.\n\n## Directory Structure\n- `/dags`: Contains the Airflow DAGs that control the data pipeline.\n- `/dbt`: Contains DBT models and configuration.\n- `docker-compose.yml`: The Docker Compose configuration to run the services.\n- `Dockerfile`: The Dockerfile for building the DBT container.\n- `/keys/gcp-key.json`: Your Google Cloud service account JSON key file (not included in the repo for security reasons).\n\n## License\nThis project is for educational purposes. Please do not use for production without proper security and configuration updates.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgades-dataeng%2Fwebinar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgades-dataeng%2Fwebinar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgades-dataeng%2Fwebinar/lists"}