{"id":34061390,"url":"https://github.com/calvinmccarter/kditransform","last_synced_at":"2026-04-07T15:31:13.100Z","repository":{"id":195903480,"uuid":"671575146","full_name":"calvinmccarter/kditransform","owner":"calvinmccarter","description":"Kernel density integral transformation: feature preprocessing and univariate clustering (TMLR, 2023)","archived":false,"fork":false,"pushed_at":"2025-10-23T17:20:57.000Z","size":16127,"stargazers_count":9,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-01-02T11:16:59.508Z","etag":null,"topics":["data-science","discretization","kernel-density-estimation","preprocessing","python","quantiles"],"latest_commit_sha":null,"homepage":"https://openreview.net/pdf?id=6OEcDKZj5j","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/calvinmccarter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-07-27T16:22:08.000Z","updated_at":"2025-10-23T17:20:10.000Z","dependencies_parsed_at":"2023-09-20T04:10:26.601Z","dependency_job_id":"a24b9682-e835-4944-9d9f-37f08c3c622c","html_url":"https://github.com/calvinmccarter/kditransform","commit_stats":{"total_commits":5,"total_committers":1,"mean_commits":5.0,"dds":0.0,"last_synced_commit":"308ba779bbc0b26c3dae15889065f37158f60117"},"previous_names":["calvinmccarter/kditransform"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/calvinmccarter/kditransform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/calvinmccarter%2Fkditransform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/calvinmccarter%2Fkditransform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/calvinmccarter%2Fkditransform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/calvinmccarter%2Fkditransform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/calvinmccarter","download_url":"https://codeload.github.com/calvinmccarter/kditransform/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/calvinmccarter%2Fkditransform/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31518398,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T03:10:19.677Z","status":"ssl_error","status_checked_at":"2026-04-07T03:10:13.982Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","discretization","kernel-density-estimation","preprocessing","python","quantiles"],"created_at":"2025-12-14T04:52:29.657Z","updated_at":"2026-04-07T15:31:13.094Z","avatar_url":"https://github.com/calvinmccarter.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# kditransform\n\n[![PyPI version](https://badge.fury.io/py/kditransform.svg)](https://badge.fury.io/py/kditransform)\n[![Downloads](https://pepy.tech/badge/kditransform)](https://pepy.tech/project/kditransform)\n\nThe kernel-density integral transformation [(McCarter, 2023, TMLR)](https://openreview.net/pdf?id=6OEcDKZj5j), like [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [quantile transformation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), maps continuous features to the range `[0, 1]`.\nIt achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.\nIt can also be used to discretize features, offering a data-driven alternative to univariate clustering or [K-bins discretization](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-discretization).\n\nYou can tune the interpolation $\\alpha$ between 0 (quantile transform) and $\\infty$ (min-max transform), but a good default is $\\alpha=1$, which is equivalent to using `scipy.stats.gaussian_kde(bw_method=1)`. This is an easy way to improves performance for a lot of supervised learning problems. See [this notebook](https://github.com/calvinmccarter/kditransform/blob/master/examples/regression-plots.ipynb) for example usage and the [paper](https://openreview.net/pdf?id=6OEcDKZj5j) for a detailed description of the method.\n\n\u003cfigure\u003e\n  \u003cfigcaption\u003e\u003ci\u003eAccuracy on Iris\u003c/i\u003e\u003c/figcaption\u003e\n  \u003cimg src=\"examples/Accuracy-vs-bwf-iris-pca.jpg\" alt=\"drawing\" width=\"300\"/\u003e\n\u003c/figure\u003e\n\u003cfigure\u003e\n  \u003cfigcaption\u003e\u003ci\u003erMSE on CA Housing\u003c/i\u003e\u003c/figcaption\u003e\n  \u003cimg src=\"examples/MSE-vs-bwf-cahousing-linr-nolegend.jpg\" alt=\"drawing\" width=\"300\"/\u003e\n\u003c/figure\u003e\n    \n\n## Installation \n\n### Installation from PyPI\n```\npip install kditransform\n```\n\n### Installation from source\nAfter cloning this repo, install the dependencies on the command-line, then install kditransform:\n```\npip install -r requirements.txt\npip install -e .\npytest\n```\n\n## Usage\n\n`kditransform.KDITransformer` is a drop-in replacement for [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html). When `alpha` (defaults to 1.0) is small, our method behaves like the QuantileTransformer; when `alpha` is large, it behaves like [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).\n\nTo produce features that are roughly scaled like z-scores as in `StandardScaler`, use `KDITransformer(output_distribution='normal')`. This applies the standard normal inverse CDF transform after the KDI transform.\n\n```\nimport numpy as np\nfrom kditransform import KDITransformer\nX = np.random.uniform(size=(500, 1))\nkdt = KDITransformer(alpha=1.)\nY = kdt.fit_transform(X)\n```\n\n`kditransform.KDIDiscretizer` offers an API based on [sklearn.preprocessing.KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html). It encodes each feature ordinally, similarly to `KBinsDiscretizer(encode='ordinal')`.\n\n```\nfrom kditransform import KDIDiscretizer\nrng = np.random.default_rng(1)\nx1 = rng.normal(1, 0.75, size=int(0.55*N))\nx2 = rng.normal(4, 1, size=int(0.3*N))\nx3 = rng.uniform(0, 20, size=int(0.15*N))\nX = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)\nkdd = KDIDiscretizer()\nT = kdd.fit_transform(X)\n```\n\nInitialized as `KDIDiscretizer(enable_predict_proba=True)`, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.\n\n```\nkdd = KDIDiscretizer(enable_predict_proba=True).fit(X)\nP = kdd.predict(X)  # one-hot encoding\nP = kdd.predict_proba(X)  # probabilistic one-hot encoding\n```\n\n## Citing this method\n\nIf you use this tool, please cite KDITransform\nusing the following reference to our [TMLR paper](https://openreview.net/pdf?id=6OEcDKZj5j):\n\nIn Bibtex format:\n\n```bibtex\n@article{\nmccarter2023the,\ntitle={The Kernel Density Integral Transformation},\nauthor={Calvin McCarter},\njournal={Transactions on Machine Learning Research},\nissn={2835-8856},\nyear={2023},\nurl={https://openreview.net/forum?id=6OEcDKZj5j},\nnote={}\n}\n```\n\n## Usage with TabPFN\n\n[TabPFN](https://arxiv.org/abs/2207.01848) is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored \u0026 power-transformed features. After simply [adding KDITransform'ed features](https://github.com/calvinmccarter/TabPFN/commit/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d#diff-6e18bf62a38856a86e8846cefd2d9fd323dc178c161d4e63d23bf613dc6de654), I observed [improvements](https://github.com/calvinmccarter/TabPFN/blob/e51e6621e2f1820d5646b14640fcfb9ef13f3c2d/replicate-kditransform.ipynb) on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcalvinmccarter%2Fkditransform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcalvinmccarter%2Fkditransform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcalvinmccarter%2Fkditransform/lists"}