{"id":13688924,"url":"https://github.com/vanderschaarlab/hyperimpute","last_synced_at":"2025-04-07T06:10:57.365Z","repository":{"id":37899684,"uuid":"438955479","full_name":"vanderschaarlab/hyperimpute","owner":"vanderschaarlab","description":"A framework for prototyping and benchmarking imputation methods","archived":false,"fork":false,"pushed_at":"2023-04-04T04:42:20.000Z","size":438,"stargazers_count":177,"open_issues_count":3,"forks_count":14,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-31T04:06:19.832Z","etag":null,"topics":["data-science","imputation","imputation-algorithm","machine-learning","machine-learning-prerequisites","preprocessing-data","python","scikit-learn"],"latest_commit_sha":null,"homepage":"https://www.vanderschaar-lab.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vanderschaarlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-16T10:52:06.000Z","updated_at":"2025-03-07T05:37:35.000Z","dependencies_parsed_at":"2024-09-25T00:09:50.518Z","dependency_job_id":"f5fd5135-4376-4a57-9558-d6dc2e83ec5e","html_url":"https://github.com/vanderschaarlab/hyperimpute","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanderschaarlab%2Fhyperimpute","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanderschaarlab%2Fhyperimpute/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanderschaarlab%2Fhyperimpute/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanderschaarlab%2Fhyperimpute/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vanderschaarlab","download_url":"https://codeload.github.com/vanderschaarlab/hyperimpute/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247601448,"owners_count":20964864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","imputation","imputation-algorithm","machine-learning","machine-learning-prerequisites","preprocessing-data","python","scikit-learn"],"created_at":"2024-08-02T15:01:27.711Z","updated_at":"2025-04-07T06:10:57.343Z","avatar_url":"https://github.com/vanderschaarlab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# HyperImpute - A library for NaNs and nulls.\n\n\u003cdiv align=\"center\"\u003e\n\n[![Test In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zGm4VeXsJ-0x6A5_icnknE7mbJ0knUig?usp=sharing)\n[![Tests PR](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_pr.yml/badge.svg)](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_pr.yml)\n[![Tests Full](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_full.yml/badge.svg)](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_full.yml)\n[![Tutorials](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_tutorials.yml/badge.svg)](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_tutorials.yml)\n[![Documentation Status](https://readthedocs.org/projects/hyperimpute/badge/?version=latest)](https://hyperimpute.readthedocs.io/en/latest/?badge=latest)\n\n\n[![arXiv](https://img.shields.io/badge/arXiv-2206.07769-b31b1b.svg)](https://arxiv.org/abs/2206.07769)\n[![](https://pepy.tech/badge/hyperimpute)](https://pypi.org/project/hyperimpute/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)\n[![slack](https://img.shields.io/badge/chat-on%20slack-purple?logo=slack)](https://join.slack.com/t/vanderschaarlab/shared_invite/zt-1pzy8z7ti-zVsUPHAKTgCd1UoY8XtTEw)\n\n\n![image](https://github.com/vanderschaarlab/hyperimpute/raw/main/docs/arch.png \"HyperImpute\")\n\n\u003c/div\u003e\n\n\nHyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines.\nIt includes various novel algorithms for missing data and is compatible with [sklearn](https://scikit-learn.org/stable/).\n\n\n## HyperImpute features\n- :rocket: Fast and extensible dataset imputation algorithms, compatible with sklearn.\n- :key: New iterative imputation method: HyperImpute.\n- :cyclone: Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.\n- :fire: Pluginable architecture.\n\n## :rocket: Installation\n\nThe library can be installed from PyPI using\n```bash\n$ pip install hyperimpute\n```\nor from source, using\n```bash\n$ pip install .\n```\n\n## :boom: Sample Usage\nList available imputers\n```python\nfrom hyperimpute.plugins.imputers import Imputers\n\nimputers = Imputers()\n\nimputers.list()\n```\nImpute a dataset using one of the available methods\n```python\nimport pandas as pd\nimport numpy as np\nfrom hyperimpute.plugins.imputers import Imputers\n\nX = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])\n\nmethod = \"gain\"\n\nplugin = Imputers().get(method)\nout = plugin.fit_transform(X.copy())\n\nprint(method, out)\n```\nSpecify the baseline models for HyperImpute\n```python\nimport pandas as pd\nimport numpy as np\nfrom hyperimpute.plugins.imputers import Imputers\n\nX = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])\n\nplugin = Imputers().get(\n    \"hyperimpute\",\n    optimizer=\"hyperband\",\n    classifier_seed=[\"logistic_regression\"],\n    regression_seed=[\"linear_regression\"],\n)\n\nout = plugin.fit_transform(X.copy())\nprint(out)\n```\nUse an imputer with a SKLearn pipeline\n```python\nimport pandas as pd\nimport numpy as np\n\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import RandomForestRegressor\n\nfrom hyperimpute.plugins.imputers import Imputers\n\nX = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])\ny = pd.Series([1, 2, 1, 2])\n\nimputer = Imputers().get(\"hyperimpute\")\n\nestimator = Pipeline(\n    [\n        (\"imputer\", imputer),\n        (\"forest\", RandomForestRegressor(random_state=0, n_estimators=100)),\n    ]\n)\n\nestimator.fit(X, y)\n```\nWrite a new imputation plugin\n```python\nfrom sklearn.impute import KNNImputer\nfrom hyperimpute.plugins.imputers import Imputers, ImputerPlugin\n\nimputers = Imputers()\n\nknn_imputer = \"custom_knn\"\n\nclass KNN(ImputerPlugin):\n    def __init__(self) -\u003e None:\n        super().__init__()\n        self._model = KNNImputer(n_neighbors=2, weights=\"uniform\")\n\n    @staticmethod\n    def name():\n        return knn_imputer\n\n    @staticmethod\n    def hyperparameter_space():\n        return []\n\n    def _fit(self, *args, **kwargs):\n        self._model.fit(*args, **kwargs)\n        return self\n\n    def _transform(self, *args, **kwargs):\n        return self._model.transform(*args, **kwargs)\n\nimputers.add(knn_imputer, KNN)\n\nassert imputers.get(knn_imputer) is not None\n```\nBenchmark imputation models on a dataset\n```python\nfrom sklearn.datasets import load_iris\nfrom hyperimpute.plugins.imputers import Imputers\nfrom hyperimpute.utils.benchmarks import compare_models\n\nX, y = load_iris(as_frame=True, return_X_y=True)\n\nimputer = Imputers().get(\"hyperimpute\")\n\ncompare_models(\n    name=\"example\",\n    evaluated_model=imputer,\n    X_raw=X,\n    ref_methods=[\"ice\", \"missforest\"],\n    scenarios=[\"MAR\"],\n    miss_pct=[0.1, 0.3],\n    n_iter=2,\n)\n```\n\n## 📓 Tutorials\n - [Tutorial 0: Imputation basics](tutorials/tutorial_00_imputer_plugins.ipynb)\n - [Tutorial 1: AutoML for imputation](tutorials/tutorial_01_bayesian_optimization_over_imputers.ipynb)\n - [Tutorial 2: Benchmark](tutorials/tutorial_02_benchmark_models.ipynb)\n\n## :zap: Imputation methods\nThe following table contains the default imputation plugins:\n\n| Strategy | Description| Code |\n|--- | --- | --- |\n|**HyperImpute**|Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets| [`plugin_hyperimpute.py`](src/hyperimpute/plugins/imputers/plugin_hyperimpute.py) |\n|**Mean**|Replace the missing values using the mean along each column with [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)| [`plugin_mean.py`](src/hyperimpute/plugins/imputers/plugin_mean.py) |\n|**Median**|Replace the missing values using the median along each column with [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) |  [`plugin_median.py`](src/hyperimpute/plugins/imputers/plugin_median.py) |\n|**Most-frequent**|Replace the missing values using the most frequent value along each column with [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)|[`plugin_most_freq.py`](src/hyperimpute/plugins/imputers/plugin_most_freq.py) |\n|**MissForest**|Iterative imputation method based on Random Forests using [`IterativeImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) and [`ExtraTreesRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)| [`plugin_missforest.py`](src/hyperimpute/plugins/imputers/plugin_missforest.py) |\n|**ICE**| Iterative imputation method based on regularized linear regression using [`IterativeImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) and [`BayesianRidge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html)| [`plugin_ice.py`](src/hyperimpute/plugins/imputers/plugin_ice.py)|\n|**MICE**| Multiple imputations based on ICE using [`IterativeImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) and [`BayesianRidge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html)| [`plugin_mice.py`](src/hyperimpute/plugins/imputers/plugin_mice.py) |\n|**SoftImpute**|  [`Low-rank matrix approximation via nuclear-norm regularization`](https://jmlr.org/papers/volume16/hastie15a/hastie15a.pdf)| [`plugin_softimpute.py`](src/hyperimpute/plugins/imputers/plugin_softimpute.py)|\n|**EM**|Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - [`EM imputation algorithm`](https://joon3216.github.io/research_materials/2019/em_imputation.html)|[`plugin_em.py`](src/hyperimpute/plugins//imputers/plugin_em.py) |\n|**Sinkhorn**|[`Missing Data Imputation using Optimal Transport`](https://arxiv.org/pdf/2002.03860.pdf)|[`plugin_sinkhorn.py`](src/hyperimpute/plugins/imputers/plugin_sinkhorn.py) |\n|**GAIN**|[`GAIN: Missing Data Imputation using Generative Adversarial Nets`](https://arxiv.org/abs/1806.02920)|[`plugin_gain.py`](src/hyperimpute/plugins/imputers/plugin_gain.py) |\n|**MIRACLE**|[`MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms`](https://arxiv.org/abs/2111.03187)|[`plugin_miracle.py`](src/hyperimpute/plugins/imputers/plugin_miracle.py) |\n|**MIWAE**|[`MIWAE: Deep Generative Modelling and Imputation of Incomplete Data`](https://arxiv.org/abs/1812.02633)|[`plugin_miwae.py`](src/hyperimpute/plugins/imputers/plugin_miwae.py) |\n\n\n## :hammer: Tests\n\nInstall the testing dependencies using\n```bash\npip install .[testing]\n```\nThe tests can be executed using\n```bash\npytest -vsx\n```\n## Citing\n\nIf you use this code, please cite the associated paper:\n\n```\n@article{Jarrett2022HyperImpute,\n  doi = {10.48550/ARXIV.2206.07769},\n  url = {https://arxiv.org/abs/2206.07769},\n  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},\n  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},\n  year = {2022},\n  booktitle={39th International Conference on Machine Learning},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvanderschaarlab%2Fhyperimpute","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvanderschaarlab%2Fhyperimpute","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvanderschaarlab%2Fhyperimpute/lists"}