{"id":50127638,"url":"https://github.com/sunnyadn/comprisk","last_synced_at":"2026-05-23T20:35:15.057Z","repository":{"id":354584891,"uuid":"1222696007","full_name":"sunnyadn/comprisk","owner":"sunnyadn","description":"Python toolkit for competing risks: forest (RSF) today; Fine-Gray + Aalen-Johansen + Gray's test + cause-specific Cox in v0.4. Scales to n=10⁶ in ~1 min, 10–22× faster than randomForestSRC on real EHR data, sklearn-compatible.","archived":false,"fork":false,"pushed_at":"2026-05-14T16:49:01.000Z","size":2641,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T20:35:05.450Z","etag":null,"topics":["biostatistics","competing-risks","machine-learning","numba","python","random-forest","random-survival-forest","scikit-learn","survival-analysis"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/comprisk/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sunnyadn.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-27T16:06:05.000Z","updated_at":"2026-05-14T16:49:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/sunnyadn/comprisk","commit_stats":null,"previous_names":["sunnyadn/crforest","sunnyadn/comprisk"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/sunnyadn/comprisk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunnyadn%2Fcomprisk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunnyadn%2Fcomprisk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunnyadn%2Fcomprisk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunnyadn%2Fcomprisk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sunnyadn","download_url":"https://codeload.github.com/sunnyadn/comprisk/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunnyadn%2Fcomprisk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33412082,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T18:09:33.147Z","status":"ssl_error","status_checked_at":"2026-05-23T18:09:31.380Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biostatistics","competing-risks","machine-learning","numba","python","random-forest","random-survival-forest","scikit-learn","survival-analysis"],"created_at":"2026-05-23T20:35:14.134Z","updated_at":"2026-05-23T20:35:15.048Z","avatar_url":"https://github.com/sunnyadn.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# comprisk\n\n[![PyPI version](https://img.shields.io/pypi/v/comprisk.svg)](https://pypi.org/project/comprisk/)\n[![CI](https://github.com/sunnyadn/comprisk/actions/workflows/ci.yml/badge.svg)](https://github.com/sunnyadn/comprisk/actions/workflows/ci.yml)\n[![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19876282-blue)](https://doi.org/10.5281/zenodo.19876282)\n\n**comprisk** — a Python toolkit for competing risks. Ships a scalable,\nscikit-learn-compatible competing-risks random survival forest plus the\nthree canonical regression / non-parametric methods clinical researchers\nactually need: Fine-Gray subdistribution-hazard regression, a stand-alone\nAalen-Johansen cumulative-incidence estimator with cmprsk-parity\nvariance, and cause-specific Cox PH (see [Roadmap](#roadmap)). Designed\nto remove the Python → R workflow split that applied researchers\ncurrently endure for competing-risks survival analysis.\n\n\u003e **Status: alpha.** API and internals may change before v1.0.\n\u003e **Renamed from `crforest` in 0.3.1** — `pip install comprisk`,\n\u003e `from comprisk import CompetingRiskForest`.\n\n## Highlights\n\n- **The four canonical CR methods, native Python.** `FineGrayRegression`\n  matches `R cmprsk::crr()` β̂ to floating-point noise (max |Δβ| = 1.4e-15\n  on three reference datasets); `robust_se=True` returns the Geskus\n  cluster sandwich agreeing with cmprsk's IPCW-corrected SE to ~3 digits.\n  `CumulativeIncidence` reproduces `cmprsk::cuminc()` to 1e-9 across CIF\n  and variance. `gray_test` reproduces `cmprsk::cuminc()$Tests` to 1e-14.\n  `CauseSpecificCox` matches `survival::coxph(method=\"breslow\")` to 1e-9.\n- **Only native-Python competing-risks RSF.** Cause-specific log-rank\n  splitting + composite CR log-rank, Aalen-Johansen CIF, Nelson-Aalen CHF,\n  Wolbers + Uno IPCW concordance, OOB Breiman VIMP, Ishwaran minimal-depth\n  variable selection, exact TreeSHAP.\n- **CR-aware model evaluation.** `score_cr` reports IPCW time-dependent\n  AUC and Brier score under competing risks, plus integrated AUC / Brier\n  (iAUC, IBS) with bootstrap CIs; `calibration_cr` returns tidy quantile-\n  decile calibration data with per-bin Wilson intervals — one-call\n  replacements for the CR-mode `riskRegression::Score()` / `plotCalibration()`\n  blocks, taking a dict of named candidate models.\n- **10–22× faster than [randomForestSRC](https://cran.r-project.org/package=randomForestSRC)**\n  on real EHR data (CHF 14–22×, SEER 11.6×; full tables in\n  [docs/benchmarks.md](docs/benchmarks.md)), with C ≈ 0.85 on both\n  libraries. ~95× faster than rfSRC built without OpenMP (default R-on-macOS).\n- **Order-of-magnitude faster than [scikit-survival](https://scikit-survival.readthedocs.io/)**\n  (16.6× at n = 5k, 544× at n = 50k), without disabling CIF/CHF outputs.\n- **Bit-identical to randomForestSRC** with `equivalence=\"rfsrc\"` —\n  reproduces the per-tree mtry/nsplit RNG stream for paper-grade\n  reproducibility, sensitivity checks, and rfSRC-baseline migrations.\n\n## comprisk vs alternatives\n\n|                                          | comprisk                       | randomForestSRC                    | scikit-survival          |\n|------------------------------------------|:------------------------------:|:----------------------------------:|:------------------------:|\n| Language                                 | Python                         | R                                  | Python                   |\n| Native competing risks                   | ✓                              | ✓                                  | ✗ (single-event only)    |\n| Aalen–Johansen CIF output                | ✓                              | ✓                                  | n/a                      |\n| Cumulative hazard at scale               | ✓                              | ✓                                  | ✗¹                       |\n| OOB permutation VIMP                     | ✓                              | ✓                                  | ✗                        |\n| Bit-identical reproducibility mode       | ✓ (`equivalence=\"rfsrc\"`)      | —                                  | n/a                      |\n| Scales to n = 10⁶                        | ✓ (63 s on i7)                 | memory-bound past n ≈ 500 000 on consumer hardware | ✗¹ / OOM²                |\n| Default parallelism                      | ✓ (`n_jobs=-1`)                | OpenMP (build-dependent; macOS Apple clang lacks it) | ✓        |\n| GPU preview                              | ✓ (CUDA 12)                    | ✗                                  | ✗                        |\n\n¹ sksurv `RandomSurvivalForest(low_memory=True)` is the only mode that\nscales beyond ~10k samples, but it disables `predict_cumulative_hazard_function`\nand `predict_survival_function` (raises `NotImplementedError`).\n² sksurv `low_memory=False` exposes CHF / survival outputs but stores per-leaf\nfull CHF arrays; peak RSS reaches 16.8 GB at n = 5k on synthetic, OOMs\n(\u003e 21.5 GB) at n = 10k on a 24 GB host.\n\n## Install\n\n```bash\npip install comprisk          # or:  uv add comprisk\npip install \"comprisk[gpu]\"   # or:  uv add 'comprisk[gpu]'\n```\n\nRequires Python ≥ 3.10. Core dependencies: numpy, scipy, pandas, joblib,\nnumba, scikit-learn. GPU extra adds cupy + CUDA 12 runtime libs (preview;\nfaster only at low feature count today, full rewrite scheduled for v1.1).\n\n## Quickstart\n\n```python\nimport numpy as np\nfrom comprisk import CompetingRiskForest\n\n# Toy competing-risks data: 500 subjects, 6 features, 2 causes (+ censoring).\nrng = np.random.default_rng(42)\nn = 500\nX = rng.normal(size=(n, 6))\ntime = rng.exponential(2.0, size=n) + 0.1\nevent = rng.choice([0, 1, 2], size=n, p=[0.4, 0.4, 0.2])  # 0 = censored\n\n# Fit. Defaults: n_estimators=100, max_features=\"sqrt\", logrankCR, n_jobs=-1.\nforest = CompetingRiskForest(n_estimators=100, random_state=42).fit(X, time, event)\n\n# Aalen-Johansen cumulative incidence over the forest's chosen time grid.\ncif = forest.predict_cif(X[:5])                       # (5, n_causes, n_times)\n\n# Cause-specific Wolbers concordance.\nprint(\"C-index, cause 1:\", forest.score(X, time, event, cause=1))\n```\n\n### Explainability and feature selection\n\n```python\n# OOB permutation importance (Uno IPCW-scored).\nvimp = forest.compute_importance(random_state=42)\n\n# Ishwaran minimal-depth variable selection.\nselected = forest.minimal_depth().query(\"selected\")[\"feature\"].tolist()\n\n# Exact TreeSHAP attributions (Lundberg 2018, Algorithm 2).\nshap, base = forest.shap_values(X[:10])               # (n, p, n_times, n_causes)\n```\n\n[`examples/shap_explain.py`](examples/shap_explain.py) is an interactive\n[marimo](https://marimo.io) notebook (a plain `.py` file) that walks through\nSHAP additivity, per-cause global importance, and per-subject attribution over\nthe time grid, with sliders for the forest size and the subject under\ninspection. Run it with `uv run --extra examples marimo edit examples/shap_explain.py`\n(or `uvx marimo edit --sandbox examples/shap_explain.py` to use the notebook's\nown PEP 723 dependency header).\n\n### Fine-Gray, Aalen-Johansen, Gray's test, and cause-specific Cox\n\n```python\nfrom comprisk import (\n    FineGrayRegression, CumulativeIncidence, CauseSpecificCox, gray_test,\n)\n\n# Fine-Gray subdistribution-hazard regression — matches R cmprsk::crr()\n# β̂ to floating-point noise. robust_se=True gives the Geskus cluster\n# sandwich (matches cmprsk's IPCW-corrected SE to ~3 digits).\nfg = FineGrayRegression(cause=1, robust_se=True).fit(X, time=time, event=event)\nprint(fg.coef_, fg.se_)\nF = fg.predict_cumulative_incidence(X[:5])            # (5, n_event_times)\n\n# Non-parametric Aalen-Johansen CIF (cmprsk::cuminc parity, optional groups).\nci = CumulativeIncidence().fit(time=time, event=event, group=group_var)\nest, var = ci.timepoints([1.0, 5.0, 10.0])            # (n_curves, 3)\n\n# Gray's K-sample test for CIFs — matches cmprsk::cuminc()$Tests to 1e-14.\nresult = gray_test(time, event, group_var, cause=1)\nprint(result.stat, result.pvalue, result.df)\n\n# Cause-specific Cox PH — competing events censored at t_j.\n# Matches survival::coxph(method=\"breslow\") to 1e-9.\ncs = CauseSpecificCox(cause=1).fit(X, time=time, event=event)\n```\n\nPenalized variable selection for the Fine-Gray model (LASSO / ridge /\nelastic-net / MCP / SCAD) — no equivalent elsewhere in Python:\n\n```python\nfrom comprisk import PenalizedFineGrayRegression\n\n# Cyclic coordinate descent on the IPCW-weighted partial likelihood,\n# warm-started along a 100-point lambda path. cv=K picks lambda by the\n# cross-validated partial-likelihood deviance; coefficients + sandwich SEs\n# match R crrp::crrp() (Fu et al. 2017) along the whole path to ~1e-6.\npen = PenalizedFineGrayRegression(penalty=\"lasso\", cv=5).fit(X, time=time, event=event)\nprint(pen.coef_, pen.lambda_min_, pen.lambda_1se_)\npen.coef_path_                                        # (p, n_lambda)\n```\n\nDetailed walkthroughs — additivity checks, global SHAP importance, sklearn-\ncompatible slicing, performance caveats, rfSRC threshold compatibility — in\n[docs/quickstart.md](docs/quickstart.md), which also covers data format,\nprediction shapes, cross-validation, GPU, and rfSRC migration.\n\n\u003e **scikit-learn drop-in.** `CompetingRiskForest` is a real sklearn\n\u003e estimator (`BaseEstimator`, `clone()`-friendly, picklable).\n\u003e `cross_val_score`, `KFold`, `Pipeline` work without a wrapper — pass\n\u003e `Surv.from_arrays(event, time)` as the `y` argument, or use the legacy\n\u003e 3-arg `fit(X, time, event)` form. Full example in\n\u003e [docs/quickstart.md § Cross-validation](docs/quickstart.md#cross-validation).\n\n## Roadmap\n\ncomprisk is intentionally CR-focused. For non-CR survival methods\n(general Cox PH, AFT, parametric, deep-survival, Kaplan-Meier as a\nstandalone API), use [lifelines](https://lifelines.readthedocs.io/) or\n[scikit-survival](https://scikit-survival.readthedocs.io/).\n\n| Version  | Module                                                | Status               |\n|----------|-------------------------------------------------------|----------------------|\n| v0.3     | `CompetingRiskForest` (CR-RSF)                        | Shipped              |\n| **v0.4** | `FineGrayRegression` (subdistribution hazard)         | Shipped              |\n| **v0.4** | `CumulativeIncidence` (stand-alone Aalen-Johansen)    | Shipped              |\n| **v0.4** | `gray_test` (Gray's K-sample log-rank)                | Shipped              |\n| **v0.4** | `CauseSpecificCox` (CR-aware censoring)               | Shipped              |\n| **v0.4** | `score_cr` / `calibration_cr` (CR-aware evaluation)   | Shipped              |\n| **v0.5** | `PenalizedFineGrayRegression` (LASSO/ridge/elastic-net/MCP/SCAD) | Shipped    |\n| v1.0     | API freeze + JMLR MLOSS submission                    | Planned              |\n| v1.1     | Full GPU rewrite                                      | Planned              |\n\n## Benchmarks\n\nHeadline numbers — full tables, methodology, and reproducibility scripts\nin [docs/benchmarks.md](docs/benchmarks.md).\n\n**vs randomForestSRC, matched-pair on real EHR data:**\n\n| Cohort | n × p | Hardware | comprisk | rfSRC OMP-on | Speedup |\n|---|---|---|---|---|---|\n| CHF (cardio) | 75k × 58 | Apple M4 / i7-14700K / HPC | 5.6–9.4 s | 84.8–207.3 s | **14–22×** |\n| SEER breast (oncology) | 238k × 17 | HPC Xeon Gold 6148 | 7.0 s | 81.6 s | **11.6×** |\n\nBoth libraries fit similarly well at every tested workload (HF /\ncancer-specific C ≈ 0.85). The 10–22× cross-dataset band tracks feature\ncount: rfSRC's per-split exhaustive scan scales with p, so the gap\nnarrows on lower-p cohorts. ~95× speedup vs rfSRC built without OpenMP\n(default R-on-macOS install).\n\n**vs scikit-survival, paired on i7-14700K** — synthetic 2-cause Weibull,\np = 58, both libraries at their best config:\n\n| n | sksurv `low_memory=True` | comprisk | speedup |\n|---|---|---|---|\n| 5 000 | 18.2 s | 1.10 s | **16.6×** |\n| 50 000 | 2935 s (49 min) | 5.40 s | **544×** |\n\nThe gap widens super-linearly (sksurv ≈ n^2.2; comprisk ≈ n^0.7).\ncomprisk also provides Aalen-Johansen CIF + Nelson-Aalen CHF that\nsksurv `low_memory=True` raises `NotImplementedError` for.\n\n**Scaling on a consumer desktop:** n = 10⁶ in **63 s** on i7-14700K,\n14.5 GB RSS. Reproducible via\n[`validation/spikes/lambda/exp5_paper_scale_bench.py`](validation/spikes/lambda/exp5_paper_scale_bench.py).\n\n## API\n\nFull parameter list in [`src/comprisk/forest.py`](src/comprisk/forest.py);\nusage by task in [docs/quickstart.md](docs/quickstart.md). Two splitrules\nare available: `logrankCR` (composite competing-risks log-rank, default)\nand `logrank` (cause-specific).\n\n## Documentation\n\n- [Quickstart](docs/quickstart.md) — common tasks with runnable code\n- [PRD](docs/prd.md) — what comprisk aims to be at v1.0\n- [Equivalence vs rfSRC](docs/equivalence-vs-rfsrc.md) — cross-library validation methodology\n- [References](docs/REFERENCES.md) — algorithmic provenance (Park-Miller, Bays-Durham, Wolbers 2009, Uno 2011, Cole \u0026 Hernán 2008, Breiman 2001, Ishwaran 2008/2014, etc.)\n\n## Development\n\nRequires [`uv`](https://docs.astral.sh/uv/).\n\n```bash\nuv venv\nuv pip install -e \".[dev]\"\nuv run pre-commit install\nuv run pytest\nuv run ruff check .\nuv run ruff format --check .\n```\n\n## License\n\nApache-2.0. See [LICENSE](LICENSE) and [NOTICE](NOTICE).\n\n## Citation\n\n```bibtex\n@software{yang_comprisk_2026,\n  author    = {Yang, Sunny and Zhao, Wanqi},\n  title     = {{comprisk: a Python toolkit for competing risks}},\n  year      = {2026},\n  publisher = {Zenodo},\n  version   = {0.3.1},\n  doi       = {10.5281/zenodo.19876282},\n  url       = {https://doi.org/10.5281/zenodo.19876282},\n}\n```\n\nDOI is concept-level (always resolves to the latest version). GitHub's\n\"Cite this repository\" button generates a version-specific record from\n[`CITATION.cff`](CITATION.cff). Algorithmic references in\n[`docs/REFERENCES.md`](docs/REFERENCES.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsunnyadn%2Fcomprisk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsunnyadn%2Fcomprisk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsunnyadn%2Fcomprisk/lists"}