{"id":28660466,"url":"https://github.com/datatweets/airflow-pyspark-k8s","last_synced_at":"2026-05-20T05:01:49.924Z","repository":{"id":295963339,"uuid":"991781623","full_name":"datatweets/airflow-pyspark-k8s","owner":"datatweets","description":"Run Apache Airflow with KubernetesExecutor and PySpark on Kubernetes using Helm charts and Kind for local development","archived":false,"fork":false,"pushed_at":"2025-06-04T15:01:13.000Z","size":384,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-04T21:07:00.392Z","etag":null,"topics":["airflow","airflow-dags","apache-spark","data-engineering","data-pipelines","kubernetes-deployment","python"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datatweets.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-28T06:27:50.000Z","updated_at":"2025-06-04T14:13:05.000Z","dependencies_parsed_at":"2025-05-28T09:33:56.804Z","dependency_job_id":null,"html_url":"https://github.com/datatweets/airflow-pyspark-k8s","commit_stats":null,"previous_names":["datatweets/airflow-pyspark-k8s"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/datatweets/airflow-pyspark-k8s","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatweets%2Fairflow-pyspark-k8s","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatweets%2Fairflow-pyspark-k8s/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatweets%2Fairflow-pyspark-k8s/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatweets%2Fairflow-pyspark-k8s/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datatweets","download_url":"https://codeload.github.com/datatweets/airflow-pyspark-k8s/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatweets%2Fairflow-pyspark-k8s/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259634172,"owners_count":22887688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","airflow-dags","apache-spark","data-engineering","data-pipelines","kubernetes-deployment","python"],"created_at":"2025-06-13T11:00:33.426Z","updated_at":"2025-10-07T01:11:01.725Z","avatar_url":"https://github.com/datatweets.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Airflow + PySpark on Kubernetes\n\n[![Apache Airflow](https://img.shields.io/badge/Apache%20Airflow-2.x-blue.svg)](https://airflow.apache.org/) [![Kubernetes](https://img.shields.io/badge/Kubernetes-1.21+-326ce5.svg)](https://kubernetes.io/) [![PySpark](https://img.shields.io/badge/PySpark-3.x-E25A1C.svg)](https://spark.apache.org/) [![Helm](https://img.shields.io/badge/Helm-3.x-0F1689.svg)](https://helm.sh/)\n\nA production-ready infrastructure setup for running Apache Airflow with PySpark on Kubernetes using Helm charts and Kind for local development.\n\n## Features\n\n- **Apache Airflow** with KubernetesExecutor for dynamic pod scaling\n- **PySpark** integration for distributed big-data processing\n- **PostgreSQL** as the metadata database\n- **Helm** charts for repeatable, configurable deployments\n- **Kind** (Kubernetes in Docker) for local development\n- **Sample DAGs** demonstrating ETL workflows\n- **RBAC** configured for secure Kubernetes operations\n- **Persistent volumes** for DAGs, logs, and plugins\n\n## Table of Contents\n\n- [Architecture](#architecture)\n- [Prerequisites](#prerequisites)\n- [Quick Start](#quick-start)\n- [Port Configuration](#port-configuration)\n- [Accessing Airflow UI](#accessing-airflow-ui)\n- [DAGs \u0026 Examples](#dags--examples)\n- [Development Workflow](#development-workflow)\n- [Configuration](#configuration)\n- [Troubleshooting](#troubleshooting)\n- [License](#license)\n\n## Architecture\n\n```mermaid\ngraph TB\n    subgraph \"Local Development Environment\"\n        LM[Local Machine\u003cbr/\u003e- DAGs\u003cbr/\u003e- Scripts\u003cbr/\u003e- Logs]\n    end\n    \n    subgraph \"Kind Cluster (Kubernetes)\"\n        subgraph \"Airflow Namespace\"\n            subgraph \"Core Services\"\n                WS[Airflow Webserver\u003cbr/\u003ePort: 8080\u003cbr/\u003eNodePort: 30080]\n                SCH[Airflow Scheduler\u003cbr/\u003eKubernetesExecutor]\n                DB[(PostgreSQL\u003cbr/\u003eMetadata DB)]\n            end\n            \n            subgraph \"Storage Layer\"\n                PVC1[DAGs PVC]\n                PVC2[Logs PVC]\n                PVC3[Scripts PVC]\n            end\n            \n            subgraph \"Dynamic Task Execution\"\n                subgraph \"Regular Tasks\"\n                    WT1[Worker Pod 1\u003cbr/\u003ePython Operator]\n                    WT2[Worker Pod 2\u003cbr/\u003eBash Operator]\n                end\n                \n                subgraph \"Spark Tasks\"\n                    SD[Spark Driver Pod\u003cbr/\u003eSparkSubmitOperator]\n                    SE1[Spark Executor 1]\n                    SE2[Spark Executor 2]\n                    SE3[Spark Executor N]\n                end\n            end\n        end\n    end\n    \n    subgraph \"External Access\"\n        USER[User/Developer]\n    end\n    \n    %% Connections\n    USER --\u003e|HTTP :30080| WS\n    WS \u003c--\u003e|REST API| SCH\n    SCH --\u003e|Task Status| DB\n    WS --\u003e|Query| DB\n    \n    SCH --\u003e|Create Pod| WT1\n    SCH --\u003e|Create Pod| WT2\n    SCH --\u003e|Create Pod| SD\n    \n    SD --\u003e|Manage| SE1\n    SD --\u003e|Manage| SE2\n    SD --\u003e|Manage| SE3\n    \n    LM -.-\u003e|Volume Mount| PVC1\n    LM -.-\u003e|Volume Mount| PVC2\n    LM -.-\u003e|Volume Mount| PVC3\n    \n    PVC1 --\u003e|Mount /opt/airflow/dags| WS\n    PVC1 --\u003e|Mount /opt/airflow/dags| SCH\n    PVC1 --\u003e|Mount /opt/airflow/dags| WT1\n    PVC1 --\u003e|Mount /opt/airflow/dags| WT2\n    \n    PVC3 --\u003e|Mount /opt/airflow/scripts| SD\n    PVC2 --\u003e|Mount /opt/airflow/logs| WS\n    PVC2 --\u003e|Mount /opt/airflow/logs| WT1\n    PVC2 --\u003e|Mount /opt/airflow/logs| WT2\n    PVC2 --\u003e|Mount /opt/airflow/logs| SD\n    \n    classDef core fill:#e1f5fe,stroke:#01579b,stroke-width:2px\n    classDef storage fill:#f3e5f5,stroke:#4a148c,stroke-width:2px\n    classDef worker fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px\n    classDef spark fill:#fff3e0,stroke:#e65100,stroke-width:2px\n    classDef external fill:#fce4ec,stroke:#880e4f,stroke-width:2px\n    \n    class WS,SCH,DB core\n    class PVC1,PVC2,PVC3 storage\n    class WT1,WT2 worker\n    class SD,SE1,SE2,SE3 spark\n    class USER,LM external\n```\n\n## Repository Structure\n\n```\n.\n├── dags/                          # Airflow DAG definitions\n│   ├── hello_world_dag.py         # Basic example DAG\n│   ├── one_task_dag.py            # Single task example\n│   ├── spark_wordcount.py         # PySpark integration example\n│   └── test_kubernetes_executor.py # Kubernetes executor test\n├── k8s/                           # Kubernetes manifests\n│   └── rbac.yaml                  # RBAC permissions\n├── scripts/                       # PySpark applications and data\n│   ├── sample_text.txt            # Sample data for wordcount\n│   └── wordcount.py               # PySpark wordcount script\n├── templates/                     # Helm chart templates\n│   ├── airflow-configmap.yaml\n│   ├── airflow-init-job.yaml\n│   ├── airflow-scheduler-deployment.yaml\n│   ├── airflow-webserver-deployment.yaml\n│   ├── airflow-webserver-service.yaml\n│   ├── postgresql-deployment.yaml\n│   ├── postgresql-pvc.yaml\n│   ├── postgresql-service.yaml\n│   └── worker-pod-template-configmap.yaml\n├── .gitignore\n├── .helmignore\n├── Chart.yaml                     # Helm chart metadata\n├── README.md                      # This file\n├── kind-config.yaml               # Kind cluster configuration\n└── values.yaml                    # Helm value overrides\n```\n\n## Prerequisites\n\n### Required Tools\n\n| Tool    | Version | Installation Guide                                           |\n| ------- | ------- | ------------------------------------------------------------ |\n| Docker  | 20.10+  | [Install Docker Desktop](https://www.docker.com/products/docker-desktop) |\n| Kind    | 0.11+   | [Install Kind](https://kind.sigs.k8s.io/docs/user/quick-start/) |\n| kubectl | 1.21+   | [Install kubectl](https://kubernetes.io/docs/tasks/tools/)   |\n| Helm    | 3.0+    | [Install Helm](https://helm.sh/docs/intro/install/)          |\n| Git     | 2.0+    | [Install Git](https://git-scm.com/downloads)                 |\n\n\n### Verify Installation\n\n```bash\n# Check all tools are installed\ndocker --version\nkind --version\nkubectl version --client\nhelm version\ngit --version\n```\n---\n### NOTE\n#### Additional Development Tools\n\nFor a complete development environment, you may also want to install:\n- **Python 3.8+** - Required for local DAG development and testing\n- **Visual Studio Code** - Recommended IDE with excellent Python and Kubernetes support\n\n#### Installation Resources\n**Comprehensive setup guides available at:** [DataTweets Documentation](https://datatweets.com/docs/reference/)\n\nThe documentation includes step-by-step instructions for:\n- Python installation across different operating systems\n- Visual Studio Code setup and configuration\n\n---\n\n## Quick Start\n\n### 1. Clone the Repository\n\n```bash\ngit clone https://github.com/datatweets/airflow-pyspark-k8s.git\ncd airflow-pyspark-k8s\n```\n\n### 2. Prepare Local Directories\n\nCreate the directories used by Airflow if they don't already exist and set\npermissions so they are writable by the containers:\n\n```bash\nmkdir -p dags scripts logs plugins\nchmod -R 755 dags scripts logs plugins\n```\n\n### 3. Configure Host Paths\n\nUpdate the paths in two files to match your local environment:\n\n**In `values.yaml`:**\n\n```yaml\nvolumes:\n  hostPaths:\n    dags:    /workspace/dags\n    scripts: /workspace/scripts\n    logs:    /workspace/logs\n    plugins: /workspace/plugins\n```\n\n**In `kind-config.yaml`:**\n\n```yaml\nnodes:\n  - role: control-plane\n    extraMounts:\n      - hostPath: /Users/YOUR_USERNAME/airflow-pyspark-k8s\n        containerPath: /workspace\n```\n\n### 4. Set Java Home\n\nIn `templates/airflow-configmap.yaml`, set the appropriate JAVA_HOME:\n\n**For Intel/AMD (x86_64):**\n\n```yaml\ndata:\n  JAVA_HOME: \"/usr/lib/jvm/java-17-openjdk-amd64\"\n```\n\n**For ARM (Apple Silicon):**\n\n```yaml\ndata:\n  JAVA_HOME: \"/usr/lib/jvm/java-17-openjdk-arm64\"\n```\n\n### 5. Create the Kind Cluster\n\n```bash\nkind create cluster --name airflow-cluster --config kind-config.yaml\n```\n\n### 6. Apply RBAC Configuration\n\n\n```bash\nkubectl apply -f k8s/rbac.yaml\n```\n\n### 7. Deploy with Helm\n\n```bash\nhelm upgrade --install airflow-pyspark . \\\n  --create-namespace \\\n  --values values.yaml \\\n  --wait\n```\n\n### 8. Verify Deployment\n\n```bash\n# Check all pods are running\nkubectl get pods \n\n# Check services\nkubectl get svc \n\n# Watch pod status in real-time\nkubectl get pods -w\n```\n\n## Port Configuration\n\n### Default Port Binding (NodePort 30080)\n\nBy default, the Airflow webserver is exposed via NodePort on port 30080:\n\n```yaml\n# In templates/airflow-webserver-service.yaml\nservice:\n  type: NodePort\n  port: 8080\n  nodePort: 30080\n```\n\nAccess URL: `http://localhost:30080`\n\n### Alternative: Port 8080 Binding\n\nTo bind directly to port 8080 on your local machine, you have two options:\n\n#### Option 1: Port Forwarding (Recommended)\n\n```bash\n# Forward local port 8080 to the Airflow webserver\nkubectl port-forward svc/airflow-webserver 8080:8080 \n```\n\nAccess URL: `http://localhost:8080`\n\n#### Option 2: Modify Kind Configuration\n\nAdd port mapping to `kind-config.yaml`:\n\n```yaml\nkind: Cluster\napiVersion: kind.x-k8s.io/v1alpha4\nnodes:\n  - role: control-plane\n    extraPortMappings:\n      - containerPort: 30080\n        hostPort: 8080\n        protocol: TCP\n    extraMounts:\n      - hostPath: /Users/YOUR_USERNAME/airflow-pyspark-k8s\n        containerPath: /workspace\n```\n\nThen recreate the cluster:\n\n```bash\nkind delete cluster --name airflow-cluster\nkind create cluster --name airflow-cluster --config kind-config.yaml\n```\n\n## Accessing Airflow UI\n\n### Default Credentials\n\n- **URL:** `http://localhost:30080` (or `http://localhost:8080` if using port forwarding)\n- **Username:** `admin`\n- **Password:** `admin`\n\n### First Login\n\n1. Navigate to the Airflow UI\n2. Login with default credentials\n3. You should see the DAGs view with sample DAGs\n4. Toggle DAGs on/off using the switch\n\n## DAGs \u0026 Examples\n\n### Included DAGs\n\n| DAG                         | Description                      | Key Features                                |\n| --------------------------- | -------------------------------- | ------------------------------------------- |\n| `hello_world_dag`           | Basic workflow example           | Python \u0026 Bash operators, task dependencies  |\n| `one_task_dag`              | Minimal single task example      | Simple Python operator                      |\n| `spark_wordcount`           | PySpark integration demo         | SparkSubmitOperator, distributed processing |\n| `test_kubernetes_executor`  | Kubernetes executor validation   | Tests dynamic pod creation                  |\n\n### Creating New DAGs\n\n1. Create a new Python file in the `dags/` directory\n2. Define your DAG using Airflow's decorators or context managers\n3. Save the file - Airflow will auto-detect it within 30 seconds\n\nExample DAG structure:\n\n```python\nfrom airflow import DAG\nfrom airflow.operators.python import PythonOperator\nfrom datetime import datetime, timedelta\n\ndefault_args = {\n    'owner': 'data-team',\n    'retries': 1,\n    'retry_delay': timedelta(minutes=5),\n}\n\nwith DAG(\n    'my_new_dag',\n    default_args=default_args,\n    description='My custom DAG',\n    schedule_interval='@daily',\n    start_date=datetime(2024, 1, 1),\n    catchup=False,\n) as dag:\n    \n    def my_task():\n        print(\"Hello from my task!\")\n    \n    task = PythonOperator(\n        task_id='my_task',\n        python_callable=my_task,\n    )\n```\n\n## Development Workflow\n\n### Hot Reloading\n\nChanges to the following directories are reflected immediately:\n\n- `dags/` - New DAGs appear in UI within 30 seconds\n- `scripts/` - Updated scripts used on next task run\n- `plugins/` - Custom operators/hooks available after scheduler restart\n\n### Testing DAGs Locally\n\n```bash\n# Test DAG loading\nkubectl exec -it deployment/airflow-scheduler  -- airflow dags list\n\n# Test specific DAG\nkubectl exec -it deployment/airflow-scheduler  -- airflow dags test \u003cdag_id\u003e \u003cdate\u003e\n\n# Trigger DAG manually\nkubectl exec -it deployment/airflow-scheduler  -- airflow dags trigger \u003cdag_id\u003e\n```\n\n### Viewing Logs\n\n```bash\n# Scheduler logs\nkubectl logs deployment/airflow-scheduler  -f\n\n# Webserver logs\nkubectl logs deployment/airflow-webserver  -f\n\n# Task logs (available in UI or in logs/ directory)\ntail -f logs/dag_id=\u003cdag_id\u003e/run_id=\u003crun_id\u003e/task_id=\u003ctask_id\u003e/attempt=1.log\n```\n\n## Configuration\n\n### Key Configuration Files\n\n| File                                    | Purpose              | Key Settings                                 |\n| --------------------------------------- | -------------------- | -------------------------------------------- |\n| `values.yaml`                           | Helm value overrides | Host paths, resource limits, executor config |\n| `Chart.yaml`                            | Helm chart metadata  | Version, dependencies, app info              |\n| `kind-config.yaml`                      | Kind cluster setup   | Port mappings, volume mounts                 |\n| `templates/airflow-configmap.yaml`      | Airflow environment  | JAVA_HOME, Python paths                      |\n\n### Common Customizations\n\n#### Increase Resources\n\nIn `values.yaml`:\n\n```yaml\nscheduler:\n  resources:\n    requests:\n      memory: \"1Gi\"\n      cpu: \"500m\"\n    limits:\n      memory: \"2Gi\"\n      cpu: \"1000m\"\n```\n\n#### Add Python Dependencies\n\nCreate a custom Dockerfile:\n\n```dockerfile\nFROM apache/airflow:2.7.0\nUSER airflow\nCOPY requirements.txt /\nRUN pip install --no-cache-dir -r /requirements.txt\n```\n\n#### Configure Spark Resources\n\nIn your DAG:\n\n```python\nspark_config = {\n    \"spark.executor.memory\": \"2g\",\n    \"spark.executor.cores\": \"2\",\n    \"spark.executor.instances\": \"3\",\n}\n```\n\n## Troubleshooting\n\n### Common Issues and Solutions\n\n#### Pod Stuck in Pending/CrashLoopBackOff\n\n```bash\n# Describe pod for events\nkubectl describe pod \u003cpod-name\u003e \n\n# Check logs\nkubectl logs \u003cpod-name\u003e  --previous\n\n# Common fixes:\n# - Check resource availability: kubectl top nodes\n# - Verify volume mounts exist\n# - Check RBAC permissions\n```\n\n#### Volume Mount Failures\n\n```bash\n# Verify paths exist on host\nls -la /Users/YOUR_USERNAME/airflow-pyspark-k8s/\n\n# Check Kind node mounts\ndocker exec -it airflow-cluster-control-plane ls -la /workspace/\n\n# Ensure proper permissions\nchmod -R 755 dags/ scripts/ logs/ plugins/\n```\n\n#### Spark Job Failures\n\n```bash\n# Find Spark driver pod\nkubectl get pods  | grep spark-\n\n# Check driver logs\nkubectl logs \u003cspark-driver-pod\u003e \n\n# Common issues:\n# - JAVA_HOME not set correctly\n# - Insufficient memory for executors\n# - PySpark version mismatch\n```\n\n#### Database Connection Issues\n\n```bash\n# Check PostgreSQL pod\nkubectl logs deployment/postgres \n\n# Test connection\nkubectl exec -it deployment/airflow-scheduler -- airflow db check\n```\n\n### Debug Commands Cheatsheet\n\n```bash\n# Get all resources in the namespace\nkubectl get all \n\n# Describe deployments\nkubectl describe deployment \n\n# Execute commands in scheduler\nkubectl exec -it deployment/airflow-scheduler  -- bash\n\n# Check Airflow configuration\nkubectl exec -it deployment/airflow-scheduler  -- airflow config list\n\n# Force restart deployments\nkubectl rollout restart deployment \n```\n\n## Production Considerations\n\n### Security\n\n- [ ] Change default passwords\n- [ ] Enable RBAC in Airflow\n- [ ] Use secrets management (Kubernetes Secrets, HashiCorp Vault)\n- [ ] Configure network policies\n- [ ] Enable TLS/SSL\n\n### Scalability\n\n- [ ] Use external PostgreSQL (RDS, Cloud SQL)\n- [ ] Configure autoscaling for workers\n- [ ] Use distributed storage (S3, GCS) for logs\n- [ ] Implement resource quotas\n\n### Monitoring\n\n- [ ] Deploy Prometheus \u0026 Grafana\n- [ ] Configure Airflow metrics export\n- [ ] Set up log aggregation (ELK, Fluentd)\n- [ ] Create alerting rules\n\n### High Availability\n\n- [ ] Multiple scheduler replicas (Airflow 2.0+)\n- [ ] Database replication\n- [ ] Multi-zone node pools\n- [ ] Backup strategies\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n------\n\n**Ready to orchestrate your data pipelines?** Star this repo and start building!\n\nFor questions and support, please open an [issue](https://github.com/datatweets/airflow-pyspark-k8s/issues) or join our [discussions](https://github.com/datatweets/airflow-pyspark-k8s/discussions).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatatweets%2Fairflow-pyspark-k8s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatatweets%2Fairflow-pyspark-k8s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatatweets%2Fairflow-pyspark-k8s/lists"}