https://github.com/hackyourfuture/data-assignment-week-4

HackYourFuture data track week 4 assignment files
https://github.com/hackyourfuture/data-assignment-week-4

Last synced: 5 days ago
JSON representation

HackYourFuture data track week 4 assignment files

Host: GitHub
URL: https://github.com/hackyourfuture/data-assignment-week-4
Owner: HackYourFuture
Created: 2026-04-28T13:18:30.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-05-26T20:24:07.000Z (about 1 month ago)
Last Synced: 2026-05-26T22:11:49.393Z (about 1 month ago)
Language: Python
Size: 18.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Week 4 Assignment: MessyCorp Pandas

**Clean and report on messy sales data** · Total: 100 points · Passing: 60

Read the full assignment on the HYF Data Track: [Assignment: MessyCorp Pandas](https://hub.hackyourfuture.nl/)

---

## Where to start

Work through the files in this order:

| Step | File | Tasks |

|---|---|---|

| 1 | `src/ingest.py` | Task 1: download inputs from Azure |

| 2 | `src/clean.py` | Task 2: explore + Task 3: clean sales |

| 3 | `src/transform.py` | Task 4: join customers, add `is_high_value` |

| 4 | `src/report.py` | Task 5: build report tables + Task 6: write outputs |

| 5 | `src/ingest.py` | Task 7 *(extra credit)*: upload results to Azure |

| 6 | `main.py` | Set `GITHUB_USERNAME`, then run the full pipeline |

| 7 | `AI_ASSIST.md` | Task 8: fill in before submitting |

Open each file and read the docstrings and TODO comments — they explain exactly what to implement.

---

## Repository layout

```text

.

├── sample_data/

│   ├── messy_sales.csv      # fallback if Azure is unavailable — copy to data/ manually

│   └── messy_customers.csv

├── src/

│   ├── ingest.py       # Tasks 1 + 7 — Azure download and upload

│   ├── clean.py        # Tasks 2 + 3 — explore and clean sales data

│   ├── transform.py    # Task 4     — join customers, add is_high_value

│   └── report.py       # Tasks 5 + 6 — build tables and write outputs

├── main.py             # Pipeline runner — set GITHUB_USERNAME for Task 7

├── AI_ASSIST.md        # Task 8 — fill in before submitting

├── .gitignore          # data/ and output/ are excluded — generated at runtime

└── .hyf/

    └── test.sh         # auto-grader — read this to see exactly what is checked

```

Files the pipeline generates at runtime (gitignored):

- `data/` — raw CSVs downloaded from Azure in Task 1

- `output/` — report CSVs, Parquet, and chart written in Task 6

---

## Setup

```bash

pip install pandas azure-identity azure-storage-blob matplotlib pyarrow

```

Log in to Azure (reuses your Week 2 session):

```bash

az login

```

> **If Azure is unavailable** (login issues, no network): copy the files from `sample_data/` into a `data/` folder at the repo root, then comment out the `download_inputs(DATA_DIR)` call in `main.py`. You can complete Tasks 2–6 without Azure access and return to Tasks 1 and 7 once your session is working.

---

## Run the pipeline

Edit `GITHUB_USERNAME` in `main.py` before running Task 7, then:

```bash

python main.py

```

---

## Check your score locally

Run the same grader the auto-grader runs on every PR push:

```bash

bash .hyf/test.sh

cat .hyf/score.json

```

---

## Scoring ladder

Tasks 2–6 are the core of this assignment and are enough to pass. Tasks 7 and the code quality checks are extra credit.

| Score | What the grader checks |

|---|---|

| 14 | Stubs committed: all five function names present, Azure imports, `data/` in `.gitignore` |

| ~24 | Task 2: `.info()`, `.describe()`, `.isna().sum()`, `.head()` all called |

| ~44 | Task 3: vectorized string cleaning, `pd.to_numeric`, `pd.to_datetime`, row filters, `drop_duplicates` on `transaction_id` |

| ~59 | Task 4: email normalisation, `how="inner"` merge, vectorised `is_high_value` (no loops) |

| ~79 | Task 5: named aggregations (`total_revenue=`, `order_count=`), `isocalendar().week`, `("customer_name", "first")` |

| ~89 | Task 6: all three output files written with `index=False`, chart saved with `savefig` |

| ~94 | *(extra credit)* Task 7: `upload_outputs` uses `assert` + `len()` to verify the Azure round-trip |

| 100 | *(extra credit)* Code quality: `Path(...)` constructor and `logging.info/warning/error` calls used in `src/` |

---

## Submitting

1. Create a branch `week4/your-name`.

2. Commit your work.

3. Push and open a Pull Request.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hackyourfuture/data-assignment-week-4

Awesome Lists containing this project

README