https://github.com/hackyourfuture/data-assignment-week-2
HackYourFuture data track week 2 assignment files
https://github.com/hackyourfuture/data-assignment-week-2
Last synced: 5 days ago
JSON representation
HackYourFuture data track week 2 assignment files
- Host: GitHub
- URL: https://github.com/hackyourfuture/data-assignment-week-2
- Owner: HackYourFuture
- Created: 2026-04-28T13:18:14.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-13T19:24:56.000Z (about 1 month ago)
- Last Synced: 2026-05-13T21:26:13.902Z (about 1 month ago)
- Language: Shell
- Size: 3.54 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Track โ Week 2 Assignment (Template)
The HackYourFuture Data Track Week 2 assignment: **Refactoring to a Clean Pipeline**.
> ๐ฉโ๐ **Students:** you are in the wrong place. Do **not** fork or use this template.
> Go to your cohort's assignment repo under
> [`HackYourAssignment`](https://github.com/HackYourAssignment) (e.g. `c55-data-week2`,
> `c56-data-week2`, โฆ). Your teacher posts the exact link in your cohort channel.
> Fork the cohort repo, branch, and open a PR back to it. Full instructions live in the
> [Week 2 Assignment on Notion](https://www.notion.so/hackyourfuture/Week-2-Assignment-Refactoring-to-a-Clean-Pipeline-f8c27aa88d144cb18f54c49d02f50b73).
## For instructors / track maintainers
This repo is the **upstream template** for the Week 2 assignment. At the start of each
cohort, generate a cohort-specific repo under the `HackYourAssignment` org from this
template (GitHub: **Use this template โ Create a new repository**, owner =
`HackYourAssignment`, name = `c-data-week2`). Students then fork *that* cohort repo
and open PRs back to it; the auto-grader runs on every push.
Edits to the assignment, dataset, or grader belong here on the template, not on the
cohort copies.
## Tasks at a glance
| Task | Folder | Points | What you build |
|---|---|---|---|
| **Task 1** โ Cleaner Pipeline | `task-1/` | 60 | A modular Python pipeline with `config.py` (env-var loading), `models.py` (`Transaction` dataclass with `__post_init__` validation), `transforms.py` (4+ pure composable functions, no mutation), `pipeline.py` (orchestrator), and `tests/test_transforms.py` (4+ pytest tests). Reads `data/messy_sales.csv`, writes `output/clean_sales.csv`. |
| **Task 2** โ AI Debug Report | `task-2/` | 20 | Document one debugging session where you used an LLM to fix a bug. Fill in the four sections of `AI_DEBUG.md`. |
| **Task 3** โ Azure Blob Upload | `task-3/` | 20 | Upload `task-1/output/clean_sales.csv` to a private Blob container in the HYF Azure storage account using the portal's Storage Browser. Save your screenshot as `task-3/assets/azure_blob_week2.png` (`.jpg`/`.jpeg` also accepted) and the blob URL in `task-3/assets/blob_url.txt`. Working in Codespaces? See [AZURE_LOGIN.md](AZURE_LOGIN.md) to authenticate first. |
Total: 100 ยท Passing: 60.
## Repository layout
```text
.
โโโ task-1/
โ โโโ data/
โ โ โโโ messy_sales.csv # the dataset (committed; do not edit)
โ โโโ src/
โ โ โโโ config.py # env-var loader โ fill in TODOs
โ โ โโโ models.py # Transaction dataclass โ fill in TODOs
โ โ โโโ transforms.py # 4 pure transform functions โ fill in TODOs
โ โ โโโ pipeline.py # orchestrator โ fill in TODOs
โ โโโ tests/
โ โ โโโ test_transforms.py # 4 pytest tests โ fill in TODOs
โ โโโ output/ # your pipeline writes clean_sales.csv here (gitignored)
โ โโโ .env.example # copy to .env (gitignored) before running
โ โโโ requirements.txt # python3 -m pip install -r requirements.txt
โโโ task-2/
โ โโโ AI_DEBUG.md # fill in the four sections
โโโ task-3/
โ โโโ assets/
โ โโโ azure_blob_week2.png # add your screenshot here (jpg/jpeg also accepted)
โ โโโ blob_url.txt # paste your Azure Storage blob URL here
โโโ .hyf/
โ โโโ test.sh # auto-grader (read it to see exactly what it checks)
โโโ .github/workflows/
โโโ grade-assignment.yml # runs .hyf/test.sh on every PR
```
## Run the grader locally
Before opening a PR, run the same checks the auto-grader runs:
```bash
cd task-1
python3 -m pip install -r requirements.txt
cp .env.example .env
cd ..
bash .hyf/test.sh
cat .hyf/score.json
```
The grader prints a per-task breakdown so you can see exactly which check failed and
why. The PR-time grader does the same โ your local run and the CI run are identical.
## Scoring ladder (Task 1)
The grader awards points incrementally so partial credit is meaningful:
- **10/60** โ required files exist (`config.py`, `models.py`, `transforms.py`, `pipeline.py`, `tests/test_transforms.py`, `.env.example`).
- **20/60** โ `python -m src.pipeline` runs from `task-1/` without crashing (the grader injects `INPUT_PATH` and `OUTPUT_PATH` inline; your local `.env` is not used during grading).
- **40/60** โ `output/clean_sales.csv` passes structural checks: 12 rows (15 input โ 3 invalid/zero-quantity), lowercased emails, title-cased product names, "Unknown" filled in for missing categories, `revenue` and `vat` columns present and correctly calculated.
- **60/60** โ code looks engineered: `models.py` defines a `@dataclass` with `__post_init__`; `transforms.py` uses the `{**row, ...}` spread pattern (no mutation); `pytest tests/` reports all tests passing.
The 40-point cap exists to stop a 5-line script that hardcodes the expected JSON from getting full marks. Real engineering patterns (dataclass + spread + tests) are required for the top 20 points.