{"id":49874364,"url":"https://github.com/mtholahan/apache-airflow-mini-project","last_synced_at":"2026-05-15T11:41:41.478Z","repository":{"id":310724692,"uuid":"1040494778","full_name":"mtholahan/apache-airflow-mini-project","owner":"mtholahan","description":"Built Apache Airflow DAGs to automate Yahoo Finance stock data ingestion, storage, and querying, then extended with a Python log analyzer to monitor execution errors. Demonstrates orchestration, scheduling, operator use, and pipeline monitoring.","archived":false,"fork":false,"pushed_at":"2025-09-15T04:57:54.000Z","size":126,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-15T05:41:16.604Z","etag":null,"topics":["airflow","bootcamp","dag","data-engineering","data-pipeline","etl","logging","monitoring","python","springboard"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mtholahan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-19T04:25:10.000Z","updated_at":"2025-09-15T04:57:57.000Z","dependencies_parsed_at":"2025-09-15T05:41:22.750Z","dependency_job_id":"3fdd9bc5-1bc5-42f1-ba73-bdd52c6e88fd","html_url":"https://github.com/mtholahan/apache-airflow-mini-project","commit_stats":null,"previous_names":["mtholahan/airflow_mini_project","mtholahan/airflow-mini-project-01","mtholahan/airflow-mini-project","mtholahan/apache-airflow-mini-project"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/mtholahan/apache-airflow-mini-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fapache-airflow-mini-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fapache-airflow-mini-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fapache-airflow-mini-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fapache-airflow-mini-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mtholahan","download_url":"https://codeload.github.com/mtholahan/apache-airflow-mini-project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fapache-airflow-mini-project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33066050,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T11:35:32.926Z","status":"ssl_error","status_checked_at":"2026-05-15T11:35:31.362Z","response_time":103,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","bootcamp","dag","data-engineering","data-pipeline","etl","logging","monitoring","python","springboard"],"created_at":"2026-05-15T11:41:37.240Z","updated_at":"2026-05-15T11:41:41.473Z","avatar_url":"https://github.com/mtholahan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Airflow DAG Mini Project\r\n\r\n\r\n## 📖 Abstract\r\nThis mini-project showcases orchestration and monitoring of a data pipeline using Apache Airflow. The pipeline fetches 1-minute intraday stock data for AAPL and TSLA via the Yahoo Finance API, scheduled to run at daily market close. The workflow DAG includes tasks for data extraction, CSV persistence, HDFS-style directory organization, and a final downstream query.\r\r\nTo enhance observability, a companion Python-based log analyzer parses Airflow scheduler logs. It uses `pathlib` and text processing to identify failures, warnings, and execution status across multiple DAG runs. The output includes aggregated error counts and detailed diagnostics to aid in root cause analysis.\r\r\nThe project reinforces key Airflow concepts such as DAG authoring, dependency management, Bash/Python operators, CeleryExecutor parallelism, and operational monitoring.\r\r\nAs part of the infrastructure, I authored `start-airflow.sh`, a WSL2-compatible Docker startup script that automates Airflow init, user creation, and service startup with race-condition safeguards and reset/debug modes.\r\n\r\n\r\n\r\n## 🛠 Requirements\r\n- Docker Engine v20+ and Docker Compose v2\r\r\n- Ubuntu 22.04 LTS (tested) or compatible WSL2 environment\r\r\n- docker-compose.yaml defining:\r\r\n\t- airflow-webserver (UI on http://localhost:8080)\r\r\n\t- airflow-scheduler\r\r\n\t- airflow-worker\r\r\n\t- postgres (metadata DB)\r\r\n\t- redis (Celery broker)\r\r\n- Python dependencies (in requirements.txt):\r\r\n\t- yfinance\r\r\n\t- pandas\r\r\n\t- requests\r\r\n\t- apache-airflow-providers-postgres\r\r\n\t- apache-airflow-providers-redis\r\r\n\t- apache-airflow-providers-http\r\r\n\t- pytz\r\n\r\n\r\n\r\n## 🧰 Setup\r\n- Run bootstrap script:\r\r\n\t./start-airflow.sh --init    # Initializes Airflow metadata DB, creates user, starts services\r\r\n\t./start-airflow.sh --reset   # (Optional) Reset Airflow environment\r\r\n\t./start-airflow.sh --debug   # (Optional) Debug startup sequence\r\r\n\r\r\n- If not using script, manual steps:\r\r\n\t- docker-compose build --no-cache\r\r\n\t- docker-compose run airflow-webserver airflow db init\r\r\n\t- Create Airflow user...\r\r\n\t- docker-compose up -d\r\r\n\r\r\n- Access Airflow UI at http://localhost:8080\r\r\n- Verify DAGs load from ./dags and logs from ./logs\r\n\r\n\r\n\r\n## 📊 Dataset\r\n- Yahoo Finance API data for AAPL and TSLA with 1-minute intervals.\r\r\nSchema includes: date_time, open, high, low, close, adj_close, volume.\r\n\r\n\r\n\r\n## ⏱️ Run Steps\r\n- Start Docker services: docker-compose up -d\r\r\n- Access the Airflow UI: http://localhost:8080\r\r\n- Confirm DAG \"marketvol\" appears and is scheduled for weekdays at 6 PM\r\r\n- Task breakdown:\r\r\n\t- t0: Initialize working directory (BashOperator)\r\r\n\t- t1, t2: Fetch AAPL and TSLA data (PythonOperator)\r\r\n\t- t3, t4: Move CSVs to target location (BashOperator)\r\r\n\t- t5: Execute custom query on combined data (PythonOperator)\r\r\n- Monitor run via Airflow UI\r\r\n- Run log_analyzer.py to extract error summaries and debug insights\r\n\r\n\r\n\r\n## 📈 Outputs\r\n- CSVs of intraday stock data for AAPL and TSLA\r\r\n- Query results on combined dataset\r\r\n- Airflow execution logs\r\r\n- Log analyzer output: total error count and detailed error messages\r\n\r\n\r\n\r\n## 📸 Evidence\r\n\r\n![01_dockerized_airflow_in_operation.png](./evidence/01_dockerized_airflow_in_operation.png)  \r\nScreenshot of Dockerized Airflow\r\n\r\n![02_Airflow_UI.png](./evidence/02_Airflow_UI.png)  \r\nScreenshot of Airflow UI\r\n\r\n\r\n\r\n\r\n## 📎 Deliverables\r\n\r\n- [`marketvol_dag.py`](./deliverables/marketvol_dag.py)\r\n\r\n- [`log_analyzer_dag.py`](./deliverables/log_analyzer_dag.py)\r\n\r\n- [`log_analyzer.py`](./deliverables/log_analyzer.py)\r\n\r\n- [`marketvol_combined_log_2025-09-19_00-12-58.txt`](./deliverables/marketvol_combined_log_2025-09-19_00-12-58.txt)\r\n\r\n- [`marketvol_combined_log_2025-09-20_21-07-52.txt`](./deliverables/marketvol_combined_log_2025-09-20_21-07-52.txt)\r\n\r\n- [`report_download_aapl_2025-09-20_03-46-26.txt`](./deliverables/report_download_aapl_2025-09-20_03-46-26.txt)\r\n\r\n- [`report_download_tsla_2025-09-20_03-46-29.txt`](./deliverables/report_download_tsla_2025-09-20_03-46-29.txt)\r\n\r\n- [`start-airflow.sh`](./deliverables/start-airflow.sh)\r\n\r\n- [`docker-compose.yaml`](./deliverables/docker-compose.yaml)\r\n\r\n\r\n\r\n\r\n## 🛠️ Architecture\r\n- Directed Acyclic Graph (DAG) with task parallelism\r\r\n- CeleryExecutor for distributed task execution\r\r\n- Python-based log analyzer integrated for observability\r\r\n- Docker-based deployment using custom shell script\r\n\r\n\r\n\r\n## 🔍 Monitoring\r\n- Airflow UI for DAG run monitoring\r\r\n- Python log analyzer for automated error detection and reporting\r\n\r\n\r\n\r\n## ♻️ Cleanup\r\n- Remove temp data directories under /tmp/data\r\r\n- Optionally drop DAG definition from Airflow once complete\r\n\r\n\r\n*Generated automatically via Python + Jinja2 + SQL Server table `tblMiniProjectProgress` on 11-11-2025 15:30:52*","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtholahan%2Fapache-airflow-mini-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmtholahan%2Fapache-airflow-mini-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtholahan%2Fapache-airflow-mini-project/lists"}