{"id":51043695,"url":"https://github.com/hackyourfuture/data-assignment-week-4","last_synced_at":"2026-06-22T12:02:10.715Z","repository":{"id":360539981,"uuid":"1223590297","full_name":"HackYourFuture/data-assignment-week-4","owner":"HackYourFuture","description":"HackYourFuture data track week 4 assignment files","archived":false,"fork":false,"pushed_at":"2026-05-26T20:24:07.000Z","size":19,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T22:11:49.393Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HackYourFuture.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T13:18:30.000Z","updated_at":"2026-05-26T20:24:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/HackYourFuture/data-assignment-week-4","commit_stats":null,"previous_names":["hackyourfuture/data-assignment-week-4"],"tags_count":null,"template":true,"template_full_name":"HackYourFuture/assignment-template","purl":"pkg:github/HackYourFuture/data-assignment-week-4","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HackYourFuture%2Fdata-assignment-week-4","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HackYourFuture%2Fdata-assignment-week-4/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HackYourFuture%2Fdata-assignment-week-4/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HackYourFuture%2Fdata-assignment-week-4/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HackYourFuture","download_url":"https://codeload.github.com/HackYourFuture/data-assignment-week-4/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HackYourFuture%2Fdata-assignment-week-4/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34647750,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-22T02:00:06.391Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-22T12:02:09.952Z","updated_at":"2026-06-22T12:02:10.710Z","avatar_url":"https://github.com/HackYourFuture.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Week 4 Assignment: MessyCorp Pandas\n\n**Clean and report on messy sales data** · Total: 100 points · Passing: 60\n\nRead the full assignment on the HYF Data Track: [Assignment: MessyCorp Pandas](https://hub.hackyourfuture.nl/)\n\n---\n\n## Where to start\n\nWork through the files in this order:\n\n| Step | File | Tasks |\n|---|---|---|\n| 1 | `src/ingest.py` | Task 1: download inputs from Azure |\n| 2 | `src/clean.py` | Task 2: explore + Task 3: clean sales |\n| 3 | `src/transform.py` | Task 4: join customers, add `is_high_value` |\n| 4 | `src/report.py` | Task 5: build report tables + Task 6: write outputs |\n| 5 | `src/ingest.py` | Task 7 *(extra credit)*: upload results to Azure |\n| 6 | `main.py` | Set `GITHUB_USERNAME`, then run the full pipeline |\n| 7 | `AI_ASSIST.md` | Task 8: fill in before submitting |\n\nOpen each file and read the docstrings and TODO comments — they explain exactly what to implement.\n\n---\n\n## Repository layout\n\n```text\n.\n├── sample_data/\n│   ├── messy_sales.csv      # fallback if Azure is unavailable — copy to data/ manually\n│   └── messy_customers.csv\n├── src/\n│   ├── ingest.py       # Tasks 1 + 7 — Azure download and upload\n│   ├── clean.py        # Tasks 2 + 3 — explore and clean sales data\n│   ├── transform.py    # Task 4     — join customers, add is_high_value\n│   └── report.py       # Tasks 5 + 6 — build tables and write outputs\n├── main.py             # Pipeline runner — set GITHUB_USERNAME for Task 7\n├── AI_ASSIST.md        # Task 8 — fill in before submitting\n├── .gitignore          # data/ and output/ are excluded — generated at runtime\n└── .hyf/\n    └── test.sh         # auto-grader — read this to see exactly what is checked\n```\n\nFiles the pipeline generates at runtime (gitignored):\n- `data/` — raw CSVs downloaded from Azure in Task 1\n- `output/` — report CSVs, Parquet, and chart written in Task 6\n\n---\n\n## Setup\n\n```bash\npip install pandas azure-identity azure-storage-blob matplotlib pyarrow\n```\n\nLog in to Azure (reuses your Week 2 session):\n\n```bash\naz login\n```\n\n\u003e **If Azure is unavailable** (login issues, no network): copy the files from `sample_data/` into a `data/` folder at the repo root, then comment out the `download_inputs(DATA_DIR)` call in `main.py`. You can complete Tasks 2–6 without Azure access and return to Tasks 1 and 7 once your session is working.\n\n---\n\n## Run the pipeline\n\nEdit `GITHUB_USERNAME` in `main.py` before running Task 7, then:\n\n```bash\npython main.py\n```\n\n---\n\n## Check your score locally\n\nRun the same grader the auto-grader runs on every PR push:\n\n```bash\nbash .hyf/test.sh\ncat .hyf/score.json\n```\n\n---\n\n## Scoring ladder\n\nTasks 2–6 are the core of this assignment and are enough to pass. Tasks 7 and the code quality checks are extra credit.\n\n| Score | What the grader checks |\n|---|---|\n| 14 | Stubs committed: all five function names present, Azure imports, `data/` in `.gitignore` |\n| ~24 | Task 2: `.info()`, `.describe()`, `.isna().sum()`, `.head()` all called |\n| ~44 | Task 3: vectorized string cleaning, `pd.to_numeric`, `pd.to_datetime`, row filters, `drop_duplicates` on `transaction_id` |\n| ~59 | Task 4: email normalisation, `how=\"inner\"` merge, vectorised `is_high_value` (no loops) |\n| ~79 | Task 5: named aggregations (`total_revenue=`, `order_count=`), `isocalendar().week`, `(\"customer_name\", \"first\")` |\n| ~89 | Task 6: all three output files written with `index=False`, chart saved with `savefig` |\n| ~94 | *(extra credit)* Task 7: `upload_outputs` uses `assert` + `len()` to verify the Azure round-trip |\n| 100 | *(extra credit)* Code quality: `Path(...)` constructor and `logging.info/warning/error` calls used in `src/` |\n\n---\n\n## Submitting\n\n1. Create a branch `week4/your-name`.\n2. Commit your work.\n3. Push and open a Pull Request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackyourfuture%2Fdata-assignment-week-4","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackyourfuture%2Fdata-assignment-week-4","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackyourfuture%2Fdata-assignment-week-4/lists"}