https://github.com/markphamm/data_pipeline_cicd
Automated CI/CD pipeline for data workflows using GitHub Actions, enabling version control, testing, and deployment of data pipelines across environments.
https://github.com/markphamm/data_pipeline_cicd
Last synced: about 1 month ago
JSON representation
Automated CI/CD pipeline for data workflows using GitHub Actions, enabling version control, testing, and deployment of data pipelines across environments.
- Host: GitHub
- URL: https://github.com/markphamm/data_pipeline_cicd
- Owner: MarkPhamm
- License: mit
- Created: 2025-04-21T15:02:59.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-05-04T18:47:55.000Z (about 1 month ago)
- Last Synced: 2025-05-04T19:35:25.467Z (about 1 month ago)
- Language: Python
- Size: 1.65 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ Data Pipeline CI/CD with Github Actions
This project demonstrates how to build and automate a CI/CD pipeline for data workflows using **GitHub Actions**. With the increasing complexity of data pipelines, automation ensures reliability, reproducibility, and scalability across development, staging, and production environments.
---
## โ๏ธ Automating Data Pipelines
There are two key strategies for automating data pipelines: using orchestration platforms and setting up custom scripts with time-based triggers.
### ๐งญ 1. Orchestration Tools
Modern orchestration platforms help schedule, monitor, and manage pipeline dependencies:
- **Apache Airflow** โ Industry standard for DAG-based workflow orchestration.
- **Mage** โ A low-code alternative focused on simplicity and rapid development.
- **Dagster** โ A data-first orchestrator with solid development tooling and type safety.
- **Astronomer (Astro)** โ A commercial platform built on Airflow with managed infrastructure and deployment capabilities.Each tool can schedule ETL jobs, trigger workflows based on events, and monitor pipeline health.
---
### ๐ 2. Python + Cron Triggers
For lightweight workflows or early-stage projects, Python scripts combined with `cron` jobs are effective.
- Write your ETL logic in a Python script
- Schedule it using cron syntax (e.g., `0 0 * * *` to run at midnight) - use [crontab.guru](https://crontab.guru) to help build cron expressions
- GitHub Actions can act as a managed cron runner without relying on local schedulers or cloud platforms---
## ๐งฉ GitHub Actions for CI/CD
GitHub Actions is a native automation platform that allows you to run scripts on GitHub events (push, PR, schedule, etc.).
**Benefits:**
- โก **Free compute** for public repositories and generous limits for private ones
- ๐ ๏ธ **Easy to set up** โ just create a `.yml` file under `.github/workflows/`
- ๐งช **Test-driven development** โ run tests on push, PR, or commit
- ๐ **Secure with secrets** โ manage API keys and tokens in GitHub Settings---
## ๐งช Example: Automating ETL for YouTube Video Transcripts โ Text Embeddings
This end-to-end use case shows how to automate the entire workflow from ingestion to transformation using GitHub Actions.
### ๐ Workflow Overview
1. **Create ETL Python Script**
- Ingest YouTube video metadata and transcripts
- Clean and preprocess data
- Convert text into vector embeddings using OpenAI or HuggingFace models2. **Create GitHub Repository**
- Push your ETL logic and project structure3. **Set Up GitHub Actions Workflow**
- Create a YAML file like `.github/workflows/etl.yml`
- Define triggers (`on: push`, `on: schedule`, etc.)
- Define job steps (install dependencies, run script)4. **Add Repo Secrets**
- Store API keys like `OPENAI_API_KEY`, `YOUTUBE_API_KEY` using GitHub Secrets
- Reference them in your workflow for secure access5. **Push and Commit**
- The workflow is triggered automatically
- Your data pipeline runs in the cloud on every change or on a schedule---
## ๐ Example Workflow File (`etl.yml`)
```yaml
name: data-pipeline-workflowon:
# push: # uncomment to run on push
schedule:
- cron: "0 0 * * 1" # run every Monday at 12:00 AM (midnight).
workflow_dispatch: # manual triggers
jobs:
run-data-pipeline:
runs-on: ubuntu-latest # what system they would run on
steps:
# Steps to run the workflow
- name: Checkout repo content
uses: actions/checkout@v4 # Pulling all code from github repo
with:
token: ${{ secrets.PERSONAL_ACCESS_TOKEN }} # Use the PAT instead of the default GITHUB_TOKEN
- name: Setup python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run data pipeline
env:
YT_API_KEY: ${{ secrets.YT_API_KEY }} # import API key
run: python data_pipeline.py # run data pipeline
- name: Check for changes # create env variable indicating if any changes were made
id: git-check
run: |
git config user.name 'github-actions'
git config user.email '[email protected]'
git add .
git diff --staged --quiet || echo "changes=true" >> $GITHUB_ENV
- name: Commit and push if changes
if: env.changes == 'true' # if changes made push new data to repo
run: |
git commit -m "updated video index"
git push
```