{"id":50420023,"url":"https://github.com/2u39u4/multi-stage-recommender","last_synced_at":"2026-05-31T08:02:07.768Z","repository":{"id":353698481,"uuid":"1220555284","full_name":"2u39u4/multi-stage-recommender","owner":"2u39u4","description":"End-to-end multi-stage recommender on MovieLens: 5-channel recall fusion, DeepFM/DIN ranking, MMR re-ranking, and FastAPI + Streamlit serving with Docker, MLflow, FAISS, and CI.","archived":false,"fork":false,"pushed_at":"2026-05-31T07:11:39.000Z","size":3274,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-31T07:16:00.491Z","etag":null,"topics":["collaborative-filtering","deep-learning","docker","faiss","fastapi","hydra","information-retrieval","machine-learning","mlflow","mlops","movielens","pytorch","ranking","recommender-system","streamlit"],"latest_commit_sha":null,"homepage":"https://github.com/2u39u4/multi-stage-recommender","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/2u39u4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-25T03:12:05.000Z","updated_at":"2026-05-31T07:11:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/2u39u4/multi-stage-recommender","commit_stats":null,"previous_names":["2u39u4/multi-stage-recommender"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/2u39u4/multi-stage-recommender","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2u39u4%2Fmulti-stage-recommender","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2u39u4%2Fmulti-stage-recommender/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2u39u4%2Fmulti-stage-recommender/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2u39u4%2Fmulti-stage-recommender/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/2u39u4","download_url":"https://codeload.github.com/2u39u4/multi-stage-recommender/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2u39u4%2Fmulti-stage-recommender/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33723549,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collaborative-filtering","deep-learning","docker","faiss","fastapi","hydra","information-retrieval","machine-learning","mlflow","mlops","movielens","pytorch","ranking","recommender-system","streamlit"],"created_at":"2026-05-31T08:02:04.765Z","updated_at":"2026-05-31T08:02:07.756Z","avatar_url":"https://github.com/2u39u4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NeoRec — Production-Style Multi-Stage Recommender\n\n\u003e Portfolio-ready recommender system on MovieLens: multi-channel recall →\n\u003e DeepFM/DIN ranking → MMR/rules re-ranking → FastAPI + Streamlit serving.\n\n[![CI](https://github.com/2u39u4/multi-stage-recommender/actions/workflows/ci.yaml/badge.svg)](https://github.com/2u39u4/multi-stage-recommender/actions/workflows/ci.yaml)\n[![Python](https://img.shields.io/badge/python-3.10-blue.svg)]()\n[![PyTorch](https://img.shields.io/badge/PyTorch-2.2-EE4C2C.svg)]()\n[![FAISS](https://img.shields.io/badge/FAISS-1.8-006699.svg)]()\n[![MLflow](https://img.shields.io/badge/MLflow-2.x-0194E2.svg)]()\n[![Docker](https://img.shields.io/badge/Docker-ready-2496ED.svg)]()\n[![License](https://img.shields.io/badge/license-MIT-green.svg)]()\n\n**Repository**: https://github.com/2u39u4/multi-stage-recommender  \n**Status**: final portfolio release.\n\n---\n\n## 1. 30-Second Read\n\nNeoRec is an end-to-end recommender portfolio project that mirrors an industrial\nfunnel: retrieve ~1 000 candidates, rank them down to 20, diversify to top-10,\nand serve the result through a monitored API.\n\n- **Best recall result**: 5-channel fusion lifts Recall@10 from 0.0590\n  (best single channel) to **0.0827**, a **+40.2%** relative gain.\n- **Best ranker**: DIN beats LR / GBDT / DeepFM under an out-of-fold training\n  protocol designed to avoid look-ahead bias.\n- **Research evidence**: six controlled ablations, conversion-funnel analysis,\n  DIN attention visualization, and paired-bootstrap 95% CIs.\n- **Engineering evidence**: Hydra configs, MLflow runs, pytest + CI-configured\n  Ruff/mypy, FastAPI, Redis fallback, Streamlit dashboard, Docker Compose,\n  Prometheus hooks.\n\n| Layer | What ships | Headline |\n|---|---|---|\n| Recall | iALS + Two-Tower + SASRec + popularity + cold-start | fusion Recall@10 **0.0827** |\n| Ranking | LR / GBDT / DeepFM / DIN | DIN Recall@10 **0.0477**, AUC **0.931** |\n| Re-ranking | MMR + IPS debias + business rules | coverage +27% at λ=0.7 |\n| Serving | `/recommend`, `/metrics`, dashboard | local p50 **23.5 ms**; Docker p50 **~1.0 s** |\n| Reproducibility | cached JSON, MLflow, generated figures | README image references are committed |\n\nEvaluation uses MovieLens-1M leave-one-out with full-catalog scoring over all\n3 533 processed items and seen-item masking. Detailed model reports, run IDs,\nand reproduction commands live under [`experiments/results/`](experiments/results);\nablation caches live under [`experiments/ablations/`](experiments/ablations).\n\n---\n\n## 2. System Architecture\n\n```mermaid\nflowchart TD\n    A[User Behavior Logs\u003cbr/\u003eMovieLens 1M] --\u003e B[Feature Engineering\u003cbr/\u003e+ Feature Store]\n    B --\u003e C1[ALS / iALS\u003cbr/\u003eRecall: 300]\n    B --\u003e C2[Two-Tower DSSM\u003cbr/\u003eRecall: 500]\n    B --\u003e C3[SASRec\u003cbr/\u003eRecall: 300]\n    B --\u003e C4[Popularity + Cold-start\u003cbr/\u003eRecall: 200]\n    C1 --\u003e D[Candidate Merger\u003cbr/\u003e~1000 items]\n    C2 --\u003e D\n    C3 --\u003e D\n    C4 --\u003e D\n    D --\u003e E[DeepFM Pre-Ranking\u003cbr/\u003e1000 → 100]\n    E --\u003e F[DIN / Transformer\u003cbr/\u003eFine-Ranking\u003cbr/\u003e100 → 20]\n    F --\u003e G[Diversity + Rule\u003cbr/\u003eRe-Ranking\u003cbr/\u003e20 → 10]\n    G --\u003e H[Top-K Recommendation]\n\n    H --\u003e I[FastAPI Serving]\n    H --\u003e J[Streamlit Dashboard]\n    I --\u003e K[(Redis\u003cbr/\u003efeature cache)]\n    I --\u003e L[(FAISS HNSW\u003cbr/\u003evector index)]\n    I --\u003e M[Prometheus\u003cbr/\u003emetrics]\n```\n\n**Why multi-stage?** Real-world catalogs have $10^6$ – $10^9$ items. A single deep\nranker is computationally infeasible; the funnel architecture reduces candidate\nsize by ~5 orders of magnitude while preserving relevance, mirroring industrial\ndesigns documented by Google, Meta, ByteDance, and Pinterest.\n\n---\n\n## 3. Tech Stack\n\n| Layer | Tools |\n|---|---|\n| **Language / DL** | Python 3.10, PyTorch 2.2; optional TensorFlow/deepctr extras are isolated from the shipped path |\n| **Classic ML / CF** | `implicit` (iALS), `lightfm`, scikit-learn |\n| **Vector Search** | FAISS (HNSW, IVF-PQ) |\n| **Config \u0026 Tracking** | Hydra, MLflow, Weights \u0026 Biases (optional) |\n| **Serving** | FastAPI, Uvicorn, Redis, Streamlit; Prometheus/Grafana are optional observability services |\n| **Containerization** | Docker, docker-compose |\n| **Quality** | pytest, ruff, mypy, pre-commit, GitHub Actions |\n| **Reference impls** | [Microsoft Recommenders](https://github.com/microsoft/recommenders) (baseline cross-check) |\n\n---\n\n## 4. Project Structure\n\n```\nneorec/\n├── configs/                       # Hydra configs (composable, overridable)\n│   ├── config.yaml\n│   ├── data/movielens_1m.yaml\n│   ├── recall/{als,two_tower,sasrec,popularity}.yaml\n│   ├── rank/{deepfm,din,transformer}.yaml\n│   └── serving/default.yaml\n│\n├── data/                          # gitignored\n│   ├── raw/                       # MovieLens\n│   ├── processed/                 # parquet feature tables\n│   └── embeddings/                # user/item vectors\n│\n├── src/neorec/\n│   ├── data/\n│   │   ├── download.py\n│   │   ├── preprocess.py          # leave-one-out / time-based split\n│   │   ├── feature_store.py       # offline + online feature lookup\n│   │   └── feature_engineering.py\n│   │\n│   ├── recall/\n│   │   ├── base.py                # AbstractRecaller\n│   │   ├── als.py\n│   │   ├── two_tower.py           # DSSM / YouTubeDNN-style BPR retrieval\n│   │   ├── sasrec.py              # self-attentive sequential rec\n│   │   ├── popularity.py\n│   │   ├── cold_start.py          # content-based fallback\n│   │   └── merge.py               # weighted / RRF fusion\n│   │\n│   ├── ranking/\n│   │   ├── base.py\n│   │   ├── deepfm.py              # pre-ranking\n│   │   ├── din.py                 # fine-ranking\n│   │   └── transformer_ctr.py     # optional, BST-style\n│   │\n│   ├── rerank/\n│   │   ├── mmr.py                 # Maximal Marginal Relevance\n│   │   ├── debias.py              # long-tail / popularity debias\n│   │   └── rules.py               # business rules\n│   │\n│   ├── serving/\n│   │   ├── faiss_index.py         # HNSW build / load\n│   │   ├── feature_cache.py       # Redis client\n│   │   ├── pipeline.py            # online inference orchestrator\n│   │   ├── api.py                 # FastAPI app\n│   │   └── dashboard.py           # Streamlit\n│   │\n│   ├── eval/\n│   │   ├── metrics.py             # Recall@K, NDCG@K, MRR, Coverage, Novelty\n│   │   ├── significance.py        # paired t-test / bootstrap CI\n│   │   └── counterfactual.py      # IPS / SNIPS for offline A/B\n│   │\n│   ├── utils/\n│   │   ├── seed.py\n│   │   ├── logger.py\n│   │   └── timer.py\n│   │\n│   └── cli.py                     # `neorec train recall.als`, etc.\n│\n├── notebooks/\n│   ├── 01_eda.ipynb\n│   ├── 02_recall_analysis.ipynb\n│   ├── 03_ranking_din_attention.ipynb\n│   ├── 03_ablations.ipynb\n│   ├── 04_funnel_conversion.ipynb\n│   └── 05_statistical_tests.ipynb\n│\n├── experiments/\n│   ├── results/                   # MLflow-exported tables, plots\n│   └── ablations/\n│\n├── tests/                         # pytest unit + integration coverage\n│\n├── docker/\n│   ├── Dockerfile.train\n│   ├── Dockerfile.serve\n│   └── docker-compose.yaml        # core serving + optional observability\n│\n├── .github/workflows/ci.yaml\n├── Makefile                       # setup, training, benchmark, serving helpers\n├── pyproject.toml                 # uv / poetry-managed\n├── requirements.txt\n└── README.md\n```\n\n---\n\n## 5. Datasets\n\n| Dataset | Users | Items | Interactions | Used for |\n|---|---|---|---|---|\n| MovieLens-1M | 6 040 | 3 706 | 1 M | main experiments |\n| MovieLens-20M | 138 K | 27 K | 20 M | supported future scaling target |\n\n**Splits**: leave-one-out per user — a common protocol in the recsys literature —\nis reported throughout §7. A time-based 80 / 10 / 10 split is **implemented**\nin `data/preprocess.py` (set `data.split.strategy=time_based` to use it), but\nthe README headline tables currently report only the leave-one-out runs.\n\n**Negatives**: BPR with uniform random negatives for Two-Tower / SASRec\n(deliberately chosen over in-batch sampled softmax after observing embedding\ncollapse on this benchmark — see `experiments/results/recall_two_tower.md`).\n\n### 5.1 Exploratory Data Analysis — what we learned *before* modelling\n\n\u003e 7 figures, two non-trivial analyses (cold-start sub-population, long-tail\n\u003e Lorenz / Gini), and concrete design implications for every recall channel.\n\u003e Full notebook: [`notebooks/01_eda.ipynb`](notebooks/01_eda.ipynb). Rebuild the\n\u003e notebook with `python scripts/build_eda_notebook.py`, then execute it to\n\u003e regenerate `experiments/results/eda/*.png`.\n\n**(1) Rating distribution motivates `rating ≥ 4` binarisation.**\n57.5% of raw ratings are 4-or-5 — a strong positive signal, not noise.\nLowering the cutoff to ≥3 would keep 83.6% but inject lukewarm \"watched\"\nsignal that hurts implicit-feedback training.\n\n![Rating distribution](experiments/results/eda/01_rating_distribution.png)\n\n**(2) User activity ranges over 2+ orders of magnitude.**\nML-1M is pre-filtered by the dataset authors to ≥20 raw ratings/user, so the\n`min_interactions ≥ 5` safety filter only discards 6 users (0.1%). The real\nstory is the wide activity spread — p10 ≤ 17 positives, p90 ≥ 225 — which\nmotivates having *both* head-friendly (popularity) and tail-friendly\n(content-based) recall channels.\n\n![User activity](experiments/results/eda/02_user_activity.png?v=2)\n\n**(3) Item popularity is strongly Zipf (slope ≈ −1.57).**\nThe top-20% of items capture 72.9% of all positives — a textbook long-tail.\nThis justifies popularity as a strong baseline *and* explains why\ndebias / diversity re-ranking will matter for production-quality serving.\n\n![Item popularity Zipf](experiments/results/eda/03_item_popularity_zipf.png)\n\n**(4) Temporal structure — honest reading.**\nMedian user is active on **1 distinct day** (a single rating ceremony); only\n24.4% return on ≥3 distinct days. SASRec therefore captures mainly\n*within-session* item-to-item semantics (genre / style clustering inside the\nbatch), not multi-day preference drift — which explains why SASRec's\nRecall@10 closely matches Two-Tower's on this benchmark, and why its margin\nwould be expected to widen on streaming-style datasets (Last.fm, Yoochoose).\n\n![Temporal density](experiments/results/eda/04_temporal_density.png)\n\n**(5) Genres are multi-label, moderately skewed.**\nAverage 1.69 genres per movie; head genre (Drama) covers 40% of catalog,\ntail genre (Film-Noir) only 43 movies. This is the *right* shape for TF-IDF\ncontent features in the `cold_start` channel — head genres provide robustness,\ntail genres provide discriminative signal.\n\n![Genre frequency](experiments/results/eda/05_genre_frequency.png)\n\n**(6) Cold-start proxy: D1 (least active) vs D10 (most active).**\nML-1M has no truly cold users (pre-filter ≥20 ratings), so we use bottom-decile\nusers (≤17 positives) as a proxy. Their genre preferences match top-decile\nusers almost perfectly (cosine = 0.998) — head-genre tastes are universal.\n**Implication**: mean-popularity fallback in `cold_start.py` is essentially\nfree; TF-IDF earns its keep on **item-level** discrimination (recommending the\nright Drama, not Drama vs Western), not user-level.\n\n![Cold-start sub-population](experiments/results/eda/06_coldstart_subpopulation.png)\n\n**(7) Long-tail coverage — Lorenz / Gini.**\nGini coefficient is **0.70** — close to income-inequality levels. A\npopularity-only recommender serving top-200 items covers only ~6% of the\ncatalog. This is the formal motivation for multi-channel fusion: relying on\nany *single* signal is not enough for production-style catalog coverage.\n\n![Long-tail coverage](experiments/results/eda/07_longtail_coverage.png)\n\n\u003e **Summary — how the EDA shaped every W2 design choice.**\n\u003e\n\u003e | EDA finding | Design choice |\n\u003e |---|---|\n\u003e | 57.5% of ratings are ≥4 | binarisation threshold of 4.0 |\n\u003e | activity spans 14 → 484+ positives (p10–p90) | both popularity *and* content channels needed |\n\u003e | Zipf slope −1.57, top-20% → 73% of interactions | popularity baseline is strong; debias re-ranking on the roadmap |\n\u003e | median 1 active day; only 24.4% multi-day | SASRec captures within-session semantics — explains the modest gap vs Two-Tower on ML-1M |\n\u003e | 18 multi-label genres (avg 1.69 / movie) | TF-IDF over genres for the `cold_start` channel |\n\u003e | D1 vs D10 genre cosine ≈ 0.998 | mean-popularity fallback is safe; TF-IDF earns its keep on item discrimination |\n\u003e | Gini 0.70, popularity-only top-200 covers \u003c6% | multi-channel fusion (RRF) is *required* for catalog coverage |\n\n---\n\n## 6. Models Implemented\n\n### 6.1 Recall (multi-channel)\n\n| Model | Type | Reference |\n|---|---|---|\n| iALS | Matrix Factorization | Hu et al., ICDM 2008 |\n| DSSM Two-Tower | Deep retrieval | Huang et al., CIKM 2013 |\n| YouTubeDNN-style retrieval | Deep retrieval pattern | Covington et al., RecSys 2016 |\n| **SASRec** | Self-attentive sequential | Kang \u0026 McAuley, ICDM 2018 |\n| Popularity | Heuristic baseline | — |\n| Cold-start | Content-based (genre + meta) | — |\n\n### 6.2 Pre-Ranking \u0026 Fine-Ranking\n\n| Model | Stage | Reference |\n|---|---|---|\n| LR | Baseline | — |\n| GBDT (LightGBM) | Baseline | — |\n| **DeepFM** | Pre-rank | Guo et al., IJCAI 2017 |\n| **DIN** | Fine-rank | Zhou et al., KDD 2018 |\n| Transformer CTR (BST-style) | Optional | Chen et al., DLP-KDD 2019 |\n\n### 6.3 Re-Ranking\n\n- **MMR (Maximal Marginal Relevance)** — diversity\n- **Popularity debias** — inverse-propensity re-weighting\n- **Business rules** — already-watched filtering, category quota\n\n---\n\n## 7. Results\n\n\u003e Numbers are exported from the per-model MLflow runs and cached experiment\n\u003e artifacts. Per-section reproduction commands are linked from\n\u003e `experiments/results/`; plots and significance tests live there as well.\n\n### 7.1 Recall stage (MovieLens-1M, leave-one-out, K=200, full-rank)\n\n| Model | Recall@200 | NDCG@200 | MRR@200 | Coverage@200 |\n|---|---|---|---|---|\n| Popularity                            | 0.3543     | 0.0722   | 0.0190  | 0.213        |\n| Cold-start                            | 0.1848     | 0.0362   | 0.0089  | **0.997**    |\n| iALS                                  | 0.4997     | 0.1025   | 0.0274  | 0.824        |\n| Two-Tower                             | 0.4914     | 0.1027   | 0.0287  | 0.945        |\n| SASRec                                | 0.3305     | 0.0764   | 0.0262  | 0.891        |\n| **Multi-channel (RRF, 5ch)**          | **0.5631** | **0.1230** | 0.0370 | 0.987        |\n| **Multi-channel (norm_weighted, 5ch)** | **0.5747** | **0.1258** | 0.0380 | 0.867       |\n\n\u003e Per-model details (params, MLflow run id, repro commands): see\n\u003e [`experiments/results/recall_*.md`](experiments/results).\n\u003e Channel comparison plots: [`notebooks/02_recall_analysis.ipynb`](notebooks/02_recall_analysis.ipynb).\n\n\u003e **Fusion-gain attribution (drop-one ablation on RRF — full table in\n\u003e [`experiments/results/recall_merge.md`](experiments/results/recall_merge.md)).**\n\u003e Removing iALS / Two-Tower / SASRec costs Recall@10 −10.6% / −8.9% / −7.8%\n\u003e respectively; removing the heuristic channels (popularity, cold-start) costs\n\u003e only −0.8% / −1.8%. Each learned channel contributes a measurable, distinct\n\u003e marginal — the fused gain is not driven by any single dominant retriever.\n\n### 7.2 Ranking head-to-head — LR · GBDT · DeepFM · DIN\n\nEnd-to-end evaluation: each ranker re-ranks the merge channel's top-1 000\ncandidates per user and is scored against the held-out leave-one-out item\n(Recall / NDCG / MRR @ K). Training uses an **out-of-fold (OOF) split** —\nrecall channels are fit on each user's first 90 % of history, rankers on\nthe chronologically-later 10 % — mirroring a production wall-clock setup.\n\n| Model  | Stage     | Valid AUC | Recall@10  | NDCG@10  | Recall@100 | Latency / user |\n|--------|-----------|----------:|-----------:|---------:|-----------:|---------------:|\n| LR (hashed + side feats)    | baseline   | 0.824 | 0.0290 | 0.0153 | 0.2126 | **0.35 ms** |\n| GBDT (HistGradientBoosting) | baseline   | 0.845 | 0.0358 | 0.0164 | 0.2131 | 1.40 ms     |\n| DeepFM                      | pre-rank   | 0.889 | 0.0401 | 0.0188 | 0.2748 | 0.48 ms     |\n| **DIN (with attention)**    | fine-rank  | **0.931** | **0.0477** | **0.0214** | **0.3031** | 4.34 ms |\n\nWithin-stage ordering matches the literature: **DIN \u003e DeepFM \u003e GBDT \u003e LR**.\nPer-K detail, MLflow run IDs, a no-attention DIN ablation, and the full\nW3 retrospective (look-ahead bias investigation that motivated the OOF\ntraining pipeline) are in\n[`experiments/results/ranking_comparison.md`](experiments/results/ranking_comparison.md)\nand\n[`experiments/results/ranking_scheme_a_investigation.md`](experiments/results/ranking_scheme_a_investigation.md).\n\n\u003e **Note on absolute numbers — ranker @10 vs recall @10 on ML-1M.** Under\n\u003e the same OOF pipeline the recall layer's RRF fusion reaches\n\u003e Recall@10 = 0.061, slightly above the best ranker's 0.048 here. This is\n\u003e the expected behaviour of leave-one-out evaluation on a small\n\u003e (~3.5 K-item) dense catalog: collaborative-filtering recall already\n\u003e saturates the candidate-generation task, leaving little headroom for a\n\u003e re-ranker to push the unique held-out item from positions 11–1000 into\n\u003e the top 10. The ranker's value in this project is therefore (a)\n\u003e **within-pool discrimination** — Valid AUC ≈ 0.93 on the harder 1:4\n\u003e random-negative task; (b) **latency control** — re-rank 1 000 → 100 in\n\u003e 4 ms instead of full-rank scoring the catalog; (c) demonstrating the\n\u003e full multi-stage **infrastructure** (Hydra / MLflow / Docker / OOF\n\u003e training pipeline / serving API). On production datasets (10⁷+ items,\n\u003e real click logs, contextual features) the ranker's marginal lift over\n\u003e recall is much larger — that is the regime the §10 serving API is\n\u003e designed for.\n\nDIN's local-activation unit is evaluated in §8.4 with an attention-vs-sum\nablation. The notebook walk-through is\n[`notebooks/03_ranking_din_attention.ipynb`](notebooks/03_ranking_din_attention.ipynb).\n\n### 7.3 Online Serving \u0026 Latency\n\nW5 turns the offline funnel into a live FastAPI path:\n\n```text\nGET /recommend/{user_id}\n  → merge recall top-1000\n  → DeepFM pre-rank top-100\n  → DIN fine-rank top-20\n  → MMR + business rules top-K\n```\n\nEach response returns a `latency_ms` breakdown. The code-level in-process\nnumbers measured during W3/W4 remain the stable reference for per-user model\ncompute; container/network latency depends on the local runtime and can be\nmeasured with `make serving-benchmark` after trained artefacts are present.\n\n| Stage | Current implementation | Offline compute reference |\n|---|---|---:|\n| Recall | `MergeRecaller` loads trained ALS / Two-Tower / SASRec / popularity / cold-start artefacts; FAISS HNSW build/load utilities are in `serving/faiss_index.py` | merge recall top-1000 |\n| Pre-rank | DeepFM loads from `artifacts/rank_oof/deepfm`, keeps top-100 | ~0.48 ms / user |\n| Fine-rank | DIN loads from `artifacts/rank_oof/din`, keeps top-20 | ~4.34 ms / user |\n| Re-rank | MMR λ + watched-filter + genre/year caps | ~0.8 ms / user |\n| API overhead | FastAPI + Pydantic + Prometheus metrics | measured locally with `make serving-benchmark` |\n\nLocal serving benchmark (Mac, Python 3.11 venv, Uvicorn on `127.0.0.1:8001`,\n30 requests, concurrency=4, warm pipeline):\n\n```text\nrequests_ok=30 errors=0 elapsed_s=0.18\nqps=170.10\np50_ms=23.53\np95_ms=26.10\np99_ms=26.95\n```\n\nDocker serving benchmark (Docker Desktop, `api + redis + dashboard`, 30 requests,\nconcurrency=4, warm pipeline):\n\n```text\nrequests_ok=30 errors=0 elapsed_s=7.79\nqps=3.85\np50_ms=1002.27\np95_ms=1311.45\np99_ms=1488.62\n```\n\nStatic dashboard overview generated from cached metrics:\n\n![Dashboard overview](experiments/results/figures/dashboard_overview.png)\n\nLatest W6 focused verification:\n\n```text\nfaiss 1.13.2\nnumpy 2.4.4\npytest tests/test_api.py tests/test_serving.py tests/test_rerank.py tests/test_pipeline_e2e.py -q\n26 passed\npython scripts/check_release_ready.py\nRelease readiness: PASS\nGET /health -\u003e 200, pipeline_ready=true\nGET /metrics -\u003e 200\nGET /recommend/1 -\u003e 200\nStreamlit /_stcore/health -\u003e 200\nDocker image build -\u003e PASS\nDocker core stack (api + redis + dashboard) -\u003e PASS\n```\n\nObservability services (`mlflow`, `prometheus`, `grafana`) are defined behind\nthe Docker Compose `observability` profile. They are useful for local inspection\nbut are not required for the release serving contract above.\n\nFinal release checks additionally cover README figure generation:\n\n```bash\npython scripts/build_readme_figures.py\n# writes experiments/results/figures/*.png from cached ablation JSON\n```\n\nServing-specific commands:\n\n```bash\nmake build-faiss          # optional: artifacts/serving/faiss_hnsw.index\nmake serve                # FastAPI on :8000\nmake dashboard            # Streamlit dashboard on :8501\nmake serving-benchmark    # p50 / p95 / p99 / QPS for local API\n```\n\n### 7.4 Re-ranking — MMR + IPS + business rules\n\nEnd-to-end runs the recall → DIN → re-rank stack on the OOF test set; the\nre-rank stack is **`mmr_rerank` → `ips_rerank` (optional) → `apply_rules`**.\n\n| Setting | Recall@10 | Coverage@10 | ILS@10 (↓ better) | Latency / user |\n|---|--:|--:|--:|--:|\n| DIN only (no rerank) — §7.2 row | 0.0477 | ~0.30 | — | 4.3 ms |\n| + MMR λ=1.0 (pure relevance + rules) | 0.0520 | 0.365 | 0.368 | +0.8 ms |\n| **+ MMR λ=0.7 (deployment default)** | **0.0466** | **0.383** | **0.333** | **+0.8 ms** |\n| + MMR λ=0.5 | 0.0408 | 0.403 | 0.293 | +0.8 ms |\n| + MMR λ=0.0 (pure diversity) | 0.0277 | 0.512 | 0.168 | +0.8 ms |\n\nλ is a deployment knob, not a model knob — the ranker doesn't have to\nre-train when product wants more or less diversity. Per-step latency is\nbenchmarked on a single CPU container.\n\n\u003e **Implementation**: `src/neorec/rerank/{mmr.py, debias.py, rules.py, pipeline.py}`,\n\u003e driven by `configs/rerank/mmr.yaml`. CLI: `neorec rerank rank=din rerank=mmr 'rerank.mmr.lambda=0.7'`.\n\u003e Full ablation: §8.1.\n\n---\n\n## 8. Ablation Studies\n\nSix controlled experiments quantify what every architectural choice is worth.\nRun any of them with `python scripts/run_ablations.py \u003cname\u003e`; results land\nunder `experiments/ablations/*.json` and figures under\n`experiments/results/figures/`. The committed README figures are regenerated\nwith `python scripts/build_readme_figures.py`. Notebook walk-through:\n[`notebooks/03_ablations.ipynb`](notebooks/03_ablations.ipynb).\n\n### 8.1 MMR λ Pareto frontier\n\n![MMR Pareto](experiments/results/figures/mmr_pareto_scatter.png)\n\nSweep λ ∈ {0, 0.3, 0.5, 0.7, 1.0}. Each step trades roughly **2× more\ndiversity** for **1× less accuracy**; we ship λ=0.7 as the deployment\ndefault (the knee). Coverage climbs from 0.36 → 0.51 across the sweep;\nILS drops from 0.37 → 0.17.\n\n### 8.2 Cold-start vs hot-user performance\n\n![Cold-start bucket](experiments/results/figures/cold_start_bucket.png)\n\nCounter-intuitive but real: cold users (\u003c20 training interactions) *out-score*\nhot users (60+) on Recall@10 (0.077 vs 0.042). Under LOO, hot users have\nmany high-relevance items already in their training history crowding the\ncandidate pool — the single test positive faces stiffer competition.\nCoverage shows the inverse pattern (hot 0.30 vs cold 0.19).\n\n### 8.3 Recall fusion strategy\n\n![Fusion strategy](experiments/results/figures/fusion_strategy_bar.png)\n\n`norm_weighted` (0.0827) edges out RRF (0.0794) and beats the best single\nchannel (Two-Tower, 0.0590) by **+40%**. Each base channel covers\ndifferent *kinds* of user-item affinity; the union is broader than the\nparts.\n\n### 8.4 DIN attention vs sum pooling\n\n![DIN attention ablation](experiments/results/figures/din_attention_ablation.png)\n\n| Variant | Recall@10 | Valid AUC |\n|---|--:|--:|\n| **with attention** | **0.0459** | 0.916 |\n| sum-pool only | 0.0424 | 0.909 |\n\nAttention is +8% Recall@10 / +0.7 pp AUC in this OOF run. Payoff is modest\non ML-1M; in the DIN paper, the main reported gains are AUC/RelaImpr lifts on\nMovieLens, Amazon Electronics, and Alibaba display-ad data rather than a\ndirect Recall@10 lift on this exact protocol.\n\n### 8.5 SASRec sequence length — the surprising finding\n\n![SASRec seq length](experiments/results/figures/sasrec_seq_len.png)\n\nRecall@10 *monotonically drops* as the sequence grows: **L=10 → 0.101,\nL=100 → 0.028**. Cause: SASRec's per-position BPR loss spends capacity\non positions whose targets have nothing to do with the LOO test item.\nWith L=10 the model is essentially a next-item predictor on the most\nrecent 10 items, which is exactly the LOO task; longer L dilutes the\npredictive signal. **Long sequences only help when the evaluation\nhorizon also grows** (session-based, multi-step). A clean train/eval\ntask mismatch — exactly the kind of finding that becomes a strong\ntalking point in interviews.\n\n### 8.6 Two-Tower capacity (embedding_dim)\n\n\u003e Plan called for a `num_negatives` sweep, but our Two-Tower trainer uses\n\u003e canonical single-negative BPR (Rendle 2009) — exactly one triplet per\n\u003e positive regardless of `num_negatives`. Substituted `embedding_dim`\n\u003e as a capacity probe because it is the real model-capacity knob exposed by\n\u003e this implementation.\n\n![Two-Tower capacity](experiments/results/figures/two_tower_neg.png)\n\nCapacity helps up to a point, then plateaus or regresses — ML-1M has\n~21 M user-item cells but only ~575 K observed positive interactions, so\nlarger embeddings quickly become weakly constrained. The default dim=64 is\nthe best measured setting in this sweep, with dim=128 trading a small\nRecall@10 drop for higher coverage.\n\n### 8.7 Conversion funnel + paired bootstrap\n\n![Conversion funnel](experiments/results/figures/funnel_bars.png)\n\n| Stage | Size | Positives | Retention |\n|---|--:|--:|--:|\n| merge top-1 000 (recall) | 1 000 | 5 157 | 100.0% |\n| DeepFM top-100 (pre-rank) | 100 | 1 658 | 32.2% |\n| DIN top-20 (fine-rank) | 20 | 517 | 10.0% |\n| MMR top-10 (rerank) | 10 | 288 | 5.6% |\n\nThe recall stage is the dominant ceiling — 14.5% of LOO positives never\neven enter the merge top-1 000. Improvements there cascade through every\ndownstream metric.\n\n![Paired bootstrap CI](experiments/results/figures/significance_ci.png)\n\nEvery headline Recall@10 gets a **paired bootstrap 95% CI** (1 000 resamples, paired by user):\n\n| Model | Recall@10 | 95% bootstrap CI |\n|---|--:|--:|\n| **DIN**    | **0.0477** | [0.0428, 0.0530] |\n| DeepFM | 0.0401 | [0.0353, 0.0449] |\n| GBDT   | 0.0358 | [0.0313, 0.0404] |\n| LR     | 0.0290 | [0.0249, 0.0333] |\n\nPairwise paired-bootstrap p-values: **DIN beats every other ranker**\n(p ≤ 0.012); **DeepFM vs GBDT is *not* significant** (p = 0.167) — a\ndirect example of why CIs matter on point-estimate tables. The full\nmatrix is in\n[`notebooks/05_statistical_tests.ipynb`](notebooks/05_statistical_tests.ipynb)\nand figure\n[`significance_matrix.png`](experiments/results/figures/significance_matrix.png).\n\n---\n\n## 9. Quick Start\n\n### 9.1 Local (uv / pip)\n\n```bash\ngit clone https://github.com/2u39u4/multi-stage-recommender.git\ncd multi-stage-recommender\nuv venv \u0026\u0026 source .venv/bin/activate     # or: python -m venv .venv\nuv pip install -e \".[dev]\"               # core + dev tooling\n\n# Optional full research/demo extras:\n# uv pip install -e \".[full,dev]\"\n\n# 1. download + preprocess (~2 min for 1M)\nneorec data download dataset=movielens_1m\nneorec data preprocess\n\n# 2. train all recall channels\nneorec train recall=als\nneorec train recall=two_tower\nneorec train recall=sasrec\n\n# 3. train rankers\nneorec train rank=deepfm\nneorec train rank=din\n\n# 4. evaluate end-to-end\nneorec eval pipeline=full\n\n# 5. launch serving\nmake build-faiss                          # optional HNSW index for vector serving\nmake serve                                # FastAPI on :8000\nmake dashboard                            # Streamlit on :8501\nmake serving-benchmark                    # p50 / p95 / p99 / QPS\n```\n\n### 9.2 Docker (recommended for reproducibility)\n\nThe core Docker path expects the same local assets as the Python serving path:\nprocessed parquet files under `data/processed/` and trained model artefacts\nunder `artifacts/`. On a fresh clone, run the local data/model steps in §9.1\nor restore those directories before expecting `/recommend` to return live\nrecommendations. Without artefacts, `/health` still works and reports the\nmissing path, but `/recommend` intentionally returns a diagnostic 503.\n\nCore serving stack, verified for this release:\n\n```bash\ndocker compose -f docker/docker-compose.yaml up --build api redis dashboard\n# → API:        http://localhost:8000/docs\n# → Dashboard:  http://localhost:8501\n```\n\nOptional observability stack:\n\n```bash\ndocker compose -f docker/docker-compose.yaml --profile observability up -d\n# → MLflow UI:  http://localhost:5000\n# → Prometheus: http://localhost:9090\n# → Grafana:    http://localhost:3000\n```\n\n### 9.3 Reproduce all paper-style numbers\n\n```bash\nmake all      # downloads data and runs the core training + benchmark targets\n```\n\n### 9.4 Release readiness checks\n\n```bash\nmake test-fast\nmake release-check\npython scripts/build_readme_figures.py\ndocker compose -f docker/docker-compose.yaml build\n```\n\n---\n\n## 10. Online Serving API\n\n```http\nGET /recommend/{user_id}?k=10\u0026diversity=0.7\n```\n\n```json\n{\n  \"user_id\": 123,\n  \"items\": [\n    {\n      \"item_id\": 2571,\n      \"title\": \"Matrix, The (1999)\",\n      \"score\": 0.93,\n      \"channel\": \"din\",\n      \"explain\": \"recall=merge_rrf; pre_rank=deepfm; fine_rank=din; MMR lambda=0.70\"\n    }\n  ],\n  \"latency_ms\": {\n    \"recall\": 8.1,\n    \"pre_rank\": 4.2,\n    \"fine_rank\": 11.5,\n    \"rerank\": 0.9,\n    \"total\": 24.7\n  }\n}\n```\n\nFastAPI hydrates `OnlinePipeline.from_config()` at startup. If local training\nartefacts are missing, `/health` still works and `/recommend` returns a\ndiagnostic 503 instead of crashing the server; once artefacts exist, the live\npath uses:\n\n- `MergeRecaller` for multi-channel recall;\n- `DeepFMRanker` for 1 000 → 100 pre-ranking;\n- `DINRanker` for 100 → 20 fine-ranking;\n- `mmr_rerank` + `apply_rules` for final top-K;\n- `RedisFeatureCache` when Redis is reachable, with an in-process fallback;\n- Prometheus `/metrics` for request counts and per-stage latency histograms.\n\nDashboard: `streamlit run src/neorec/serving/dashboard.py` or Docker service\n`dashboard`. Tabs cover live recommendation, λ comparison, offline metrics,\nand DIN attention heatmap.\n\n---\n\n## 11. Engineering Practices\n\n- **Configs**: every experiment is a Hydra YAML — no magic numbers in code.\n- **Tracking**: MLflow logs params, metrics, model artefacts, and run metadata.\n- **Determinism**: `set_seed(42)` covers Python / NumPy / PyTorch / TF / CUDA.\n- **Tests**: `pytest tests/` runs unit + integration tests with coverage output; `make test-fast` is the CI-safe subset.\n- **Style**: `ruff` lint and `mypy` are wired through local commands and CI.\n- **CI**: GitHub Actions runs lint, tests, and Docker image builds on pushes / PRs.\n- **Release check**: `make release-check` verifies core imports (`faiss`, `torch`, `fastapi`, Streamlit/plotting stack, etc.) before release.\n\n---\n\n## 12. Final Scope\n\nThis repository is the final portfolio version of NeoRec. The project stops at\nthe reproducible code, offline experiments, generated figures, tests, Docker\nserving stack, and release checklist.\n\nPossible future research directions, outside this finished version:\n\n- Multi-objective ranking (CTR + dwell-time + diversity).\n- Online learning with Kafka + River.\n- LLM-based explanation layer over item metadata.\n- Graph recall with LightGCN or PinSage.\n- Causal debias with doubly robust estimators.\n\n---\n\n## 13. References\n\n1. Hu, Koren, Volinsky. *Collaborative Filtering for Implicit Feedback Datasets.* ICDM 2008.\n2. Covington, Adams, Sargin. *Deep Neural Networks for YouTube Recommendations.* RecSys 2016.\n3. Kang, McAuley. *Self-Attentive Sequential Recommendation.* ICDM 2018.\n4. Guo et al. *DeepFM: A Factorization-Machine based Neural Network for CTR Prediction.* IJCAI 2017.\n5. Zhou et al. *Deep Interest Network for Click-Through Rate Prediction.* KDD 2018.\n6. Chen et al. *Behavior Sequence Transformer for E-commerce Recommendation.* DLP-KDD 2019.\n7. Microsoft Recommenders. https://github.com/microsoft/recommenders\n\n---\n\n## 14. Author\n\n**Junye Zhao** — applying for MS in AI / ML, Fall 2027  \nGitHub: [2u39u4](https://github.com/2u39u4)\n\n\u003e *Built end-to-end as a portfolio project to demonstrate proficiency across\n\u003e the full recommender-system stack — from research-style modelling to\n\u003e production-style serving.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2u39u4%2Fmulti-stage-recommender","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F2u39u4%2Fmulti-stage-recommender","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2u39u4%2Fmulti-stage-recommender/lists"}