{"id":36954965,"url":"https://github.com/thedbcooper/pipeline-proof-concept","last_synced_at":"2026-01-13T13:01:08.426Z","repository":{"id":328093687,"uuid":"1111955187","full_name":"thedbcooper/pipeline-proof-concept","owner":"thedbcooper","description":null,"archived":false,"fork":false,"pushed_at":"2025-12-11T13:42:47.000Z","size":274,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-11T20:44:28.006Z","etag":null,"topics":["azure-blob-storage","data-engineering","data-pipeline","informatics","pydantic","python","streamlit"],"latest_commit_sha":null,"homepage":"https://public-health-data-agile-pipeline.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thedbcooper.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-07T23:51:58.000Z","updated_at":"2025-12-11T13:42:51.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/thedbcooper/pipeline-proof-concept","commit_stats":null,"previous_names":["thedbcooper/pipeline-proof-concept"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/thedbcooper/pipeline-proof-concept","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thedbcooper%2Fpipeline-proof-concept","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thedbcooper%2Fpipeline-proof-concept/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thedbcooper%2Fpipeline-proof-concept/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thedbcooper%2Fpipeline-proof-concept/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thedbcooper","download_url":"https://codeload.github.com/thedbcooper/pipeline-proof-concept/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thedbcooper%2Fpipeline-proof-concept/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28385800,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-13T12:01:30.995Z","status":"ssl_error","status_checked_at":"2026-01-13T12:00:09.625Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure-blob-storage","data-engineering","data-pipeline","informatics","pydantic","python","streamlit"],"created_at":"2026-01-13T13:00:49.701Z","updated_at":"2026-01-13T13:01:08.420Z","avatar_url":"https://github.com/thedbcooper.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧬 Public Health Data Pipeline (Lightweight Data Lakehouse)\n\n[![Python](https://img.shields.io/badge/Python-3.11-blue?logo=python\u0026logoColor=white)](https://www.python.org)\n[![Azure](https://img.shields.io/badge/Cloud-Azure%20Blob-0078D4?logo=microsoft-azure\u0026logoColor=white)](https://azure.microsoft.com)\n[![Streamlit](https://img.shields.io/badge/UI-Streamlit-FF4B4B?logo=streamlit\u0026logoColor=white)](https://streamlit.io)\n[![Polars](https://img.shields.io/badge/Data-Polars-CD792C?logo=polars)](https://pola.rs)\n[![Pydantic](https://img.shields.io/badge/Validation-Pydantic-E92063?logo=pydantic\u0026logoColor=white)](https://docs.pydantic.dev)\n[![GitHub Actions](https://img.shields.io/badge/CI%2FCD-GitHub%20Actions-2088FF?logo=github-actions\u0026logoColor=white)](https://github.com/features/actions)\n\nAn automated, serverless data pipeline designed to ingest, validate, and aggregate data. This project implements a **\"Human-in-the-Loop\"** architecture where invalid data is automatically quarantined, fixed via a UI, and re-injected into the pipeline without code changes.\n\n### 🎮 **[Live Portfolio Demo](https://public-health-data-pipeline.streamlit.app/)**\n\n\u003e **Note to Viewer:** The \"Real App\" runs locally and is not deployed. This live demo uses **Interface Abstraction** to simulate Azure Blob Storage in memory, ensuring no connection to real cloud infrastructure.\n\n-----\n\n## 🏗️ Architecture\n\n### Pipeline Workflow\n\n```mermaid\nflowchart TB\n    subgraph USER[\"👤 User Interface (Streamlit)\"]\n        UPLOAD[\"📤 Upload CSV\"]\n        FIX[\"🛠️ Fix Quarantine\"]\n        DELETE_UPLOAD[\"🗑️ Upload Deletion Request\"]\n        TRIGGER_PIPE[\"▶️ Trigger Pipeline\"]\n        TRIGGER_DEL[\"▶️ Trigger Deletion\"]\n        MONITOR[\"📊 Auto-Monitor\u003cbr/\u003e(Fragments @ 15s)\"]\n    end\n\n    subgraph GITHUB[\"🐙 GitHub Actions\"]\n        DISPATCH_PIPE[\"workflow_dispatch\"]\n        DISPATCH_DEL[\"workflow_dispatch\"]\n        WEEKLY[\"weekly_pipeline.yaml\u003cbr/\u003e(Cron: Weekly)\"]\n        DELWF[\"delete_records.yaml\u003cbr/\u003e(Manual Only)\"]\n    end\n\n    subgraph PIPELINE[\"⚙️ ETL Pipeline\"]\n        VALIDATE[\"Pydantic Validation\"]\n        ROUTE{{\"Route Data\"}}\n        UPSERT[\"Upsert Valid Records\"]\n        QUARANTINE_OP[\"Quarantine Invalid\"]\n        DELETE_OP[\"Remove Records\"]\n    end\n\n    subgraph AZURE[\"☁️ Azure Blob Storage\"]\n        LANDING[(\"📂 landing-zone\u003cbr/\u003e(Raw CSVs)\")]\n        QUAR[(\"🚨 quarantine\u003cbr/\u003e(Failed Validation)\")]\n        DATA[(\"📊 data\u003cbr/\u003e(Partitioned Parquet)\")]\n        LOGS[(\"📋 logs\u003cbr/\u003e(Execution History)\")]\n        DELREQ[(\"🗑️ deletion-requests\u003cbr/\u003e(Pending Deletions)\")]\n    end\n\n    subgraph OUTPUT[\"📈 Outputs\"]\n        REPORT[\"final_cdc_export.csv\"]\n        AUDIT[\"Audit Trail\"]\n    end\n\n    %% User actions\n    UPLOAD --\u003e LANDING\n    FIX --\u003e LANDING\n    DELETE_UPLOAD --\u003e DELREQ\n    TRIGGER_PIPE --\u003e DISPATCH_PIPE\n    TRIGGER_DEL --\u003e DISPATCH_DEL\n\n    %% GitHub triggers\n    DISPATCH_PIPE --\u003e WEEKLY\n    DISPATCH_DEL --\u003e DELWF\n    \n    %% Pipeline flows\n    WEEKLY --\u003e VALIDATE\n    LANDING --\u003e VALIDATE\n    VALIDATE --\u003e ROUTE\n    ROUTE --\u003e|\"✅ Valid\"| UPSERT\n    ROUTE --\u003e|\"❌ Invalid\"| QUARANTINE_OP\n    UPSERT --\u003e DATA\n    QUARANTINE_OP --\u003e QUAR\n    UPSERT --\u003e LOGS\n    \n    %% Deletion flow\n    DELWF --\u003e DELETE_OP\n    DELREQ --\u003e DELETE_OP\n    DELETE_OP --\u003e DATA\n    DELETE_OP --\u003e LOGS\n    \n    %% Outputs\n    DATA --\u003e REPORT\n    LOGS --\u003e AUDIT\n\n    %% Monitoring loop\n    WEEKLY -.-\u003e|\"Status API\"| MONITOR\n    DELWF -.-\u003e|\"Status API\"| MONITOR\n    MONITOR -.-\u003e|\"Poll\"| GITHUB\n\n    %% Quarantine loop\n    QUAR -.-\u003e|\"Human Review\"| FIX\n\n    %% Styling\n    classDef userStyle fill:#e1f5fe,stroke:#01579b\n    classDef githubStyle fill:#f3e5f5,stroke:#4a148c\n    classDef pipelineStyle fill:#fff3e0,stroke:#e65100\n    classDef azureStyle fill:#e3f2fd,stroke:#0d47a1\n    classDef outputStyle fill:#e8f5e9,stroke:#1b5e20\n\n    class UPLOAD,FIX,DELETE,TRIGGER,MONITOR userStyle\n    class DISPATCH,WEEKLY,DELWF githubStyle\n    class VALIDATE,ROUTE,UPSERT,QUARANTINE_OP,DELETE_OP pipelineStyle\n    class LANDING,QUAR,DATA,LOGS,DELREQ azureStyle\n    class REPORT,AUDIT outputStyle\n```\n\n### Core Data Flow\n\n1.  **Landing Zone:** User uploads raw CSVs via Streamlit Admin Console → `landing-zone` container.\n2.  **Automated Processing (GitHub Actions):** `weekly_pipeline.yaml` triggers on schedule or manual dispatch.\n      * **Validation:** `Pydantic` enforces strict schema (sample_id, test_date, result, viral_load).\n      * **Routing:** Valid data → `data` container (partitioned Parquet by week). Invalid data → `quarantine` container (CSV).\n      * **Logging:** Emoji-rich processing logs with detailed metrics saved to `logs` container as `execution_TIMESTAMP.csv`.\n3.  **Quarantine Resolution:** Admins review errors in UI, fix data (e.g., \"Positive\" → \"POS\"), reupload to `landing-zone` for automatic reprocessing.\n4.  **Deletion Workflow:** Two-step process for permanent record removal:\n      * Upload deletion request CSV (sample_id + test_date) → `deletion-requests` container\n      * Trigger `delete_records.yaml` GitHub Action → removes from partitioned data\n      * Logs deleted sample IDs to `logs` container as `deletion_TIMESTAMP.csv`\n5.  **Reporting:** Aggregated clean data exported to `final_cdc_export.csv` with complete audit trail.\n\n### Storage Containers\n\n- **landing-zone**: Raw CSV uploads from partners\n- **quarantine**: Invalid records awaiting manual review\n- **data**: Validated records in partitioned Parquet format (year=YYYY/week=WW/)\n- **logs**: Processing and deletion execution logs (CSV with processing_details)\n- **deletion-requests**: Pending deletion requests (CSV with sample_id and test_date)\n\n-----\n\n## 📂 Repository Structure\n\n```text\n.\n├── .github/workflows/\n│   ├── weekly_pipeline.yaml      # Scheduled pipeline automation (production)\n│   └── delete_records.yaml       # Manual deletion workflow trigger\n├── admin_tools/\n│   ├── demo_app.py               # 🎮 THE DEMO APP (Public Portfolio Frontend)\n│   ├── web_uploader.py           # 🔒 THE REAL APP (Local Production Admin Console)\n│   ├── mock_azure.py             # Cloud Emulation Logic for Demo\n│   ├── fetch_errors.py           # Utility: Download quarantine files\n│   ├── reingest_fixed_data.py    # Utility: Re-upload fixed data\n│   ├── generate_and_upload_mock_data.py  # Utility: Generate test data\n│   └── test_connection.py        # Utility: Test Azure connection\n├── pipeline/\n│   ├── process_data_cloud.py     # Core ETL Logic (Polars + Pydantic)\n│   ├── export_report.py          # Generate final CDC aggregate report\n│   └── delete_records.py         # Process deletion requests from CSV\n├── models.py                     # Pydantic Schema Definitions\n├── pyproject.toml                # Project dependencies (uv)\n└── README.md\n```\n\n-----\n\n## 🌟 Key Features\n\n### 1\\. \"Self-Healing\" Data Quality\n\nMost pipelines crash on bad data. This one **side-steps** it.\n\n  * **Polars + Pydantic:** Used for fast validation and data integrity.\n  * **Quarantine:** Invalid files move to `quarantine/` and wait for human review.\n  * **Human-in-the-Loop:** The Admin Console provides an Excel-like editor to fix the typo and retry.\n  * **Detailed Logging:** Every pipeline run saves comprehensive logs with emoji-rich processing details for easy debugging.\n\n### 2\\. Data Deletion Workflow\n\nA dedicated workflow for data corrections:\n\n  * **Two-Step Process:** Upload deletion requests (CSV with sample_id and test_date), then trigger workflow.\n  * **Partition-Aware:** Automatically calculates which partitions to check based on test dates.\n  * **Audit Trail:** Logs which sample IDs were deleted from which partitions with full timestamp tracking.\n  * **GitHub Actions Integration:** Secure, authenticated deletion via automated workflow.\n\n### 3\\. Real-Time Workflow Monitoring with Streamlit Fragments\n\nThe Admin Console provides **live status updates** without full page reloads:\n\n  * **`@st.fragment(run_every=\"15s\")`:** Uses Streamlit's fragment feature to auto-poll GitHub Actions API every 15 seconds while a workflow is running.\n  * **Smart Workflow Detection:** Compares UTC timestamps to distinguish between old runs and newly-triggered workflows, preventing false \"success\" messages from stale data.\n  * **Session State Persistence:** Stores workflow results in `st.session_state` so status messages persist across page interactions.\n  * **Auto-Stop Monitoring:** Automatically stops polling when workflow completes (success, failure, or cancelled) and displays final status.\n  * **Seamless UX:** Users can trigger a pipeline, navigate to other tabs, and return to see live progress or final results.\n\n### 4\\. Cloud Abstraction (Security Highlight)\n\nTo share this project publicly without exposing Azure credentials, I implemented a **Mock Object Pattern**:\n\n  * **Interface Abstraction:** The `mock_azure.py` class perfectly mirrors the official Azure SDK methods (`upload_blob`, `download_blob`, `list_blobs`).\n  * **Safety:** The Demo App injects these mock clients instead of real Azure clients, ensuring the full UI workflow runs safely in the browser's memory.\n\n-----\n\n## 🖥️ Admin Console Pages\n\nThe Streamlit Admin Console provides a complete self-service interface:\n\n| Page | Purpose |\n| :--- | :--- |\n| **🏠 Start Here** | Landing page with workflow diagrams and navigation guidance |\n| **📤 Upload New Data** | Drag-and-drop CSV upload to landing zone with file preview |\n| **🛠️ Fix Quarantine** | Excel-like data editor to review and correct validation errors |\n| **🗑️ Delete Records** | Upload deletion requests and trigger deletion workflow |\n| **⚙️ Data Ingestion** | Trigger pipeline, monitor progress, view execution history |\n| **📊 Final Report** | View/download the aggregated CDC export with all valid records |\n| **ℹ️ About** | Project context, technical implementation, and author info |\n\n-----\n\n## 🚀 How to Run\n\n### Option A: The Portfolio Demo (Browser)\n\nSimply visit the **[Live App](https://public-health-data-pipeline.streamlit.app/)**. No setup required.\n\n### Option B: The Real Production App (Local)\n\n*Note: Requires active Azure Credentials in `.env`*\n\n1.  **Install dependencies:**\n    ```bash\n    uv sync\n    ```\n2.  **Run the Admin Console:**\n    ```bash\n    uv run streamlit run admin_tools/web_uploader.py\n    ```\n3.  **Trigger the Pipeline:**\n    Click the \"Trigger Weekly Pipeline\" button in the sidebar (requires GitHub Token) to process the files you upload.\n\n-----\n\n## 🚀 Deployment Prerequisites (Required for Real App \u0026 Pipeline)\n\nTo run the full production pipeline, you must establish a secure connection between **Azure Storage** and **GitHub Actions**.\n\n### 1\\. Azure Storage Setup 🟦\n\nCreate a Storage Account and the following private containers, which serve as the structure for your Data Lake:\n\n| Container Name | Purpose |\n| :--- | :--- |\n| **landing-zone** | Receives raw uploaded CSVs from the Admin Console. |\n| **quarantine** | Holds CSV files that failed Pydantic validation (for human review). |\n| **data** | Stores the finalized, cleaned data in partitioned parquet files. |\n| **logs** | Stores execution and deletion logs as CSV files for audit trail. |\n| **deletion-requests** | Holds pending deletion request CSVs before processing. |\n\n### 2\\. Generating Secrets (Service Principal) 🔑\n\nThe GitHub Action needs a **Service Principal (SP)** to act as the \"robot\" with specific access rights. This is the most secure way to grant CI/CD access to your cloud resources.\n\nRun the Azure CLI command below to generate the necessary credentials:\n\n```bash\naz ad sp create-for-rbac \\\n--name \"GitHubPipelineRobot\" \\\n--role \"Storage Blob Data Contributor\" \\\n--scopes /subscriptions/YOUR_SUBSCRIPTION_ID/resourceGroups/YOUR_RESOURCE_GROUP\n```\n\nThis command will output three values that you must save:\n\n  * `appId` (This is your **AZURE\\_CLIENT\\_ID**)\n  * `password` (This is your **AZURE\\_CLIENT\\_SECRET**)\n  * `tenant` (This is your **AZURE\\_TENANT\\_ID**)\n\n### 3\\. Adding Secrets to GitHub Actions 🐙\n\nNavigate to your repository settings on GitHub (`Settings` $\\rightarrow$ `Secrets and variables` $\\rightarrow$ `Actions`). Add the following repository secrets based on the values you generated above:\n\n| Secret Name | Source / Value | Used By |\n| :--- | :--- | :--- |\n| `AZURE_CLIENT_ID` | The `appId` value from the CLI. | GitHub Actions workflows |\n| `AZURE_CLIENT_SECRET` | The `password` value from the CLI. | GitHub Actions workflows |\n| `AZURE_TENANT_ID` | The `tenant` value from the CLI. | GitHub Actions workflows |\n| `AZURE_STORAGE_ACCOUNT` | Your storage account name (e.g., `PHData01`). | GitHub Actions workflows |\n\n### 4\\. Configuring Local Environment (.env) 🔧\n\nFor the Streamlit apps to authenticate with Azure and trigger/monitor GitHub Actions, create a `.env` file in the project root with these variables:\n\n```env\n# Azure Authentication (same values as GitHub secrets)\nAZURE_CLIENT_ID=\u003cyour_appId\u003e\nAZURE_CLIENT_SECRET=\u003cyour_password\u003e\nAZURE_TENANT_ID=\u003cyour_tenant\u003e\nAZURE_STORAGE_ACCOUNT=\u003cyour_storage_account_name\u003e\n\n# GitHub API Access (for triggering workflows from Streamlit)\nGITHUB_TOKEN=\u003cyour_personal_access_token\u003e\nREPO_OWNER=\u003cyour_github_username\u003e\nREPO_NAME=\u003cyour_repository_name\u003e\n```\n\n**Note:** The `GITHUB_TOKEN` must be a Personal Access Token (PAT) with `workflow` scope to trigger Actions and read run status.\n\n| Secret Name     | Source / Value                                         | Used By                |\n| :-------------- | :----------------------------------------------------- | :--------------------- |\n| `GITHUB_TOKEN`  | A Fine-Grained PAT with **actions:write** scope.       | Workflow Trigger Buttons |\n| `REPO_OWNER`    | Your GitHub username (e.g., `thedbcooper`).            | Workflow Trigger Buttons |\n| `REPO_NAME`     | Your repository name (e.g., `pipeline-proof-concept`). | Workflow Trigger Buttons |\n\n-----\n\n### 👨‍💻 Created by Daniel Cooper\n\n[![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-0077B5?style=for-the-badge\u0026logo=linkedin\u0026logoColor=white)](https://www.linkedin.com/in/danielblakecooper/)\n[![GitHub](https://img.shields.io/badge/GitHub-Follow-181717?style=for-the-badge\u0026logo=github\u0026logoColor=white)](https://github.com/thedbcooper)\n[![ORCID](https://img.shields.io/badge/ORCID-0000--0002--2218--7916-A6CE39?style=for-the-badge\u0026logo=orcid\u0026logoColor=white)](https://orcid.org/0000-0002-2218-7916)\n\n*Epidemiologist \u0026 Analytics Engineer*","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthedbcooper%2Fpipeline-proof-concept","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthedbcooper%2Fpipeline-proof-concept","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthedbcooper%2Fpipeline-proof-concept/lists"}