{"id":37082891,"url":"https://github.com/finite-sample/stagecoachml","last_synced_at":"2026-01-14T10:01:20.936Z","repository":{"id":326046736,"uuid":"1103057699","full_name":"finite-sample/stagecoachml","owner":"finite-sample","description":"Build two-stage models when your features arrive in two batches at different times.","archived":false,"fork":false,"pushed_at":"2025-12-14T21:28:00.000Z","size":429,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-16T08:08:31.416Z","etag":null,"topics":["machine-learning","scikit-learn","two-stage-models"],"latest_commit_sha":null,"homepage":"https://finite-sample.github.io/stagecoachml/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/finite-sample.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-24T11:25:29.000Z","updated_at":"2025-12-14T21:28:04.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/finite-sample/stagecoachml","commit_stats":null,"previous_names":["finite-sample/stagecoachml"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/finite-sample/stagecoachml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstagecoachml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstagecoachml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstagecoachml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstagecoachml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/finite-sample","download_url":"https://codeload.github.com/finite-sample/stagecoachml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstagecoachml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28416490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","scikit-learn","two-stage-models"],"created_at":"2026-01-14T10:01:19.844Z","updated_at":"2026-01-14T10:01:20.930Z","avatar_url":"https://github.com/finite-sample.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StagecoachML\n\n[![PyPI - Version](https://img.shields.io/pypi/v/stagecoachml.svg)](https://pypi.org/project/stagecoachml)\n[![Tests](https://github.com/finite-sample/stagecoachml/actions/workflows/ci.yml/badge.svg)](https://github.com/finite-sample/stagecoachml/actions/workflows/ci.yml)\n[![Documentation](https://github.com/finite-sample/stagecoachml/actions/workflows/docs.yml/badge.svg)](https://finite-sample.github.io/stagecoachml/)\n[![Try in Browser](https://img.shields.io/badge/Try%20in%20Browser-JupyterLite-orange)](https://finite-sample.github.io/stagecoachml/lite/lab/index.html?path=quickstart_interactive.ipynb)\n[![PyPI Downloads](https://static.pepy.tech/badge/stagecoachml)](https://pepy.tech/projects/stagecoachml)\n\n\n**StagecoachML** is a tiny library for building two-stage models when your features arrive in two batches at different times.\n\nThink:\n\n- Ad serving and recommendation:  \n  first score on **user + context**, then refine on **creative/item + real-time signals**.\n- Per-customer privacy:  \n  a shared **non-sensitive trunk**, plus a **per-customer head** that uses private fields inside their own environment.\n- Latency-sensitive inference:  \n  run a fast **stage-1** model early in the request, and only run the heavier **stage-2** model when needed.\n\nStagecoachML encodes that pattern directly in the model interface instead of leaving it buried in infra and notebooks.\n\n---\n\n## When should you use StagecoachML?\n\nUse StagecoachML when:\n\n- You **can’t wait** for all features before you have to start making decisions.\n- Some features live in a different **silo** (e.g. customer’s infra) and must never\n  hit the central model.\n- You want to **tune and evaluate** the whole two-stage system as a *single* estimator\n  (train/test/CV), while still being able to:\n  - get stage-1 scores from early features, and  \n  - get refined scores once late features arrive.\n\nIf you have all your features at once and a single model is fine, this library is\nprobably overkill. But if you live with staggered features, StagecoachML keeps the\nlogic honest.\n\n---\n\n## Core idea\n\nA StagecoachML model splits features into two groups:\n\n- **Early features**: available at stage 1 (e.g. user, context).\n- **Late features**: only available at stage 2 (e.g. ad/creative/item, customer-side data).\n\nYou choose:\n\n- a **stage-1 estimator** that sees only early features, and\n- a **stage-2 estimator** that sees late features plus (optionally) the stage-1\n  prediction, and either:\n  - learns to predict the **residual** `y − ŷ₁`, or\n  - learns the final target directly.\n\nAt inference time you can:\n\n- call `predict_stage1(...)` / `predict_stage1_proba(...)` when you only have\n  early features; and\n- call `predict(...)` / `predict_proba(...)` later when you have both.\n\nUnder the hood, you still train and cross-validate it like any other sklearn estimator.\n\n---\n\n## Try it Online\n\n[![Try in Browser](https://img.shields.io/badge/Try%20in%20Browser-JupyterLite-orange)](https://finite-sample.github.io/stagecoachml/lite/lab/index.html?path=quickstart_interactive.ipynb)\n\nClick the badge above to try StagecoachML directly in your browser with interactive examples powered by Pyodide - runs instantly with zero installation!\n\n## Installation\n\nStagecoachML is a pure Python package that depends on NumPy, pandas, and scikit-learn.\n\n```bash\npip install stagecoachml\n```\n\nOr install from source:\n\n```bash\ngit clone https://github.com/finite-sample/stagecoachml.git\ncd stagecoachml\npip install -e .\n```\n\nImport the estimators:\n\n```python\nfrom stagecoachml import StagecoachRegressor, StagecoachClassifier\n```\n\n---\n\n## Quick start\n\n### Regression example (diabetes dataset)\n\n```python\nfrom stagecoachml import StagecoachRegressor\nfrom sklearn.datasets import load_diabetes\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn.metrics import r2_score\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.ensemble import RandomForestRegressor\n\n# Load data as a DataFrame\ndiabetes = load_diabetes(as_frame=True)\nX = diabetes.frame.drop(columns=[\"target\"])\ny = diabetes.frame[\"target\"]\n\n# Split columns into \"early\" and \"late\" features\nfeatures = list(X.columns)\nmid = len(features) // 2\nearly_features = features[:mid]   # pretend these arrive early\nlate_features  = features[mid:]   # pretend these arrive later\n\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=0\n)\n\n# Stage-1: fast global model on early features\nstage1 = LinearRegression()\n\n# Stage-2: more flexible model on late features + stage-1 prediction\nstage2 = RandomForestRegressor(n_estimators=200, random_state=0)\n\nmodel = StagecoachRegressor(\n    stage1_estimator=stage1,\n    stage2_estimator=stage2,\n    early_features=early_features,\n    late_features=late_features,\n    residual=True,\n    use_stage1_pred_as_feature=True,\n    inner_cv=None,            # set \u003e1 to cross-fit stage-1 preds if you care\n)\n\n# Hyper-parameter search over both stages\nparam_grid = {\n    \"stage1_estimator__fit_intercept\": [True, False],\n    \"stage2_estimator__max_depth\": [None, 5, 10],\n}\ngrid = GridSearchCV(model, param_grid, cv=5)\ngrid.fit(X_train, y_train)\n\nbest = grid.best_estimator_\n\nprint(\"Stage-1 test R²: \", r2_score(y_test, best.predict_stage1(X_test)))\nprint(\"Final   test R²: \", r2_score(y_test, best.predict(X_test)))\n```\n\n### Classification example (breast cancer dataset)\n\n```python\nfrom stagecoachml import StagecoachClassifier\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import accuracy_score, f1_score\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier\n\ndata = load_breast_cancer(as_frame=True)\nX = data.data\ny = data.target\n\nfeatures = list(X.columns)\nmid = len(features) // 2\nearly = features[:mid]\nlate  = features[mid:]\n\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=1, stratify=y\n)\n\nstage1_clf = LogisticRegression(max_iter=1000)\nstage2_clf = RandomForestClassifier(n_estimators=200, random_state=2)\n\nmodel = StagecoachClassifier(\n    stage1_estimator=stage1_clf,\n    stage2_estimator=stage2_clf,\n    early_features=early,\n    late_features=late,\n    use_stage1_pred_as_feature=True,\n)\n\nmodel.fit(X_train, y_train)\n\ndef metrics(y_true, y_pred):\n    return accuracy_score(y_true, y_pred), f1_score(y_true, y_pred)\n\n# Provisional scores from early features only\nstage1_test_proba = model.predict_stage1_proba(X_test)\nstage1_acc, stage1_f1 = metrics(y_test, (stage1_test_proba \u003e= 0.5).astype(int))\n\n# Final scores with all features\nfinal_acc, final_f1 = metrics(y_test, model.predict(X_test))\n\nprint(\"Stage-1  test accuracy/F1:\", f\"{stage1_acc:.3f}/{stage1_f1:.3f}\")\nprint(\"Final    test accuracy/F1:\", f\"{final_acc:.3f}/{final_f1:.3f}\")\n```\n\n---\n\n## API overview\n\n### `StagecoachRegressor`\n\n```python\nStagecoachRegressor(\n    stage1_estimator,\n    stage2_estimator,\n    early_features,\n    late_features,\n    residual=True,\n    use_stage1_pred_as_feature=True,\n    inner_cv=None,\n    random_state=None,\n)\n```\n\nKey points:\n\n* `stage1_estimator`: any sklearn regressor (`RandomForestRegressor`, `LinearRegression`, etc.).\n* `stage2_estimator`: another regressor for the late features (often more flexible).\n* `early_features` / `late_features`: column names defining feature arrival.\n* `residual=True`: stage 2 learns `y − ŷ₁` and we add it back at prediction time.\n* `use_stage1_pred_as_feature=True`: stage-1 prediction becomes an extra input to stage 2.\n* `inner_cv`: optional K-fold cross-fitting to generate out-of-fold stage-1 predictions for stage-2 training.\n\nMethods:\n\n* `fit(X, y)`\n* `predict_stage1(X)` – early-only predictions.\n* `predict(X)` – final predictions.\n\n### `StagecoachClassifier`\n\n```python\nStagecoachClassifier(\n    stage1_estimator,\n    stage2_estimator,\n    early_features,\n    late_features,\n    use_stage1_pred_as_feature=True,\n    inner_cv=None,\n    random_state=None,\n)\n```\n\n* Stage-1 classifier must implement `predict_proba` or `decision_function`.\n* Stage-2 classifier must implement `predict_proba`.\n* `predict_stage1_proba(X)` returns a provisional probability for the positive class\n  using early features only.\n* `predict_proba(X)` / `predict(X)` use both stages.\n\n---\n\n## Business-level use cases\n\n### 1. Ad serving \u0026 recommendation\n\n* **Stage 1 (trunk):**\n  user, session, page/context features. Run for every candidate to\n  do rough scoring / candidate pruning.\n* **Stage 2 (head):**\n  ad/creative/item-side features (embeddings, textual features, sponsorship info),\n  plus stage-1 scores. Run only on the smaller candidate set.\n\nThis lets you:\n\n* keep the expensive features and models off the critical path where possible,\n* cross-validate the *whole* two-stage scoring process as one estimator, and\n* reason explicitly about which features are actually available at each stage.\n\n### 2. Per-customer models with private fields\n\n* **Shared trunk:** trained on non-sensitive features across all customers.\n* **Per-customer head (stage 2):** trained only on that customer’s private fields\n  (GDP data, custom risk scores, internal labels) inside their environment.\n\nYou can:\n\n* ship the trunk once,\n* let each customer fit their own stage-2 model locally,\n* still evaluate how “global trunk + local head” behaves on held-out data.\n\n### 3. Latency and staged inference\n\nIf your system has a front-door budget (say ~10 ms) and a back-end budget per\nselected candidate, StagecoachML gives you a clean way to:\n\n* do rough scoring at T₁ using a small, cheap stage-1 model;\n* hydrate more features or call heavier services; and\n* refine scores at T₂ with stage-2.\n\nBecause the whole pipeline is an sklearn estimator, you don’t have to guess whether\nthis staging actually helps: you can compare two-stage vs single-stage models on\nthe same train/test splits.\n\n---\n\n## Examples\n\nThe `examples/` directory contains runnable scripts:\n\n* `examples/regression_example.py`\n  Uses the diabetes dataset, splits features into early/late, trains a\n  `StagecoachRegressor`, and compares it to a one-stage baseline.\n\n* `examples/classification_example.py`\n  Uses the breast cancer dataset, trains a `StagecoachClassifier`, and compares\n  provisional vs final predictions and a one-stage logistic baseline.\n\nRun them with:\n\n```bash\npython -m examples.regression_example\npython -m examples.classification_example\n```\n\n---\n\n## Design notes \u0026 non-goals\n\n* Treat `Stagecoach*` as **one model** for train/validation/test; don’t hand-tune\n  stages in isolation and then try to glue them.\n* `inner_cv` is an optional extra for robustness, not a replacement for normal\n  cross-validation.\n* This library is *not* a general DAG/workflow engine. If you want full pipeline\n  orchestration (scheduling, retries, monitoring, etc.), you probably want\n  Airflow/Prefect/etc. StagecoachML is about one very specific modeling pattern:\n  staged feature arrival.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffinite-sample%2Fstagecoachml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffinite-sample%2Fstagecoachml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffinite-sample%2Fstagecoachml/lists"}